[2026-03-25 14:24:41,852][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 14:24:42,661][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 14:24:42,668][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 14:24:43,501][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 14:27:31,785][__main__][INFO] - Starting iteration 0. [2026-03-25 14:27:31,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:27:31,793][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:27:37,712][__main__][INFO] - Number of regex retries in iteration 0: 0 [2026-03-25 14:27:37,713][__main__][INFO] - agents played in iteration 0 are Bob, Alice [2026-03-25 14:27:38,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.33%, Block Peak % of device VRAM: 18.62%, ΔTime: 00:00:00 [2026-03-25 14:27:38,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.33%, Block Peak % of device VRAM: 18.62%, ΔTime: 00:00:00 [2026-03-25 14:27:38,248][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:27:38,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:27:38,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:27:39,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:27:40,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:27:41,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:27:42,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:27:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:27:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:27:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:27:44,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:27:45,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:27:46,174][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:27:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:27:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:27:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:27:48,931][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:27:49,621][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:27:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:27:51,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:27:51,690][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:27:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:27:53,071][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:27:53,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:27:54,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:27:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:27:55,838][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:27:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:27:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:27:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:27:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:27:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:27:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:28:00,690][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:28:01,383][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:28:02,078][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:28:02,773][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:28:03,469][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:28:04,164][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:28:04,857][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:28:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:28:06,246][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:28:06,940][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:28:07,635][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:28:08,592][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:28:09,289][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:28:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:28:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:28:11,377][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:28:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:28:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:28:13,464][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:28:14,159][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:28:14,855][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:28:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:28:16,246][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:28:16,941][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:28:17,636][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:28:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:28:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:28:19,726][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:28:20,424][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:28:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:28:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:28:22,514][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:28:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:28:23,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:28:24,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.08%, ΔTime: 00:00:45 [2026-03-25 14:28:26,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:28:26,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:28:26,230][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:28:28,139][__main__][INFO] - Iteration 1 took 56s (10.51% Gen, 86.10% Train). Generation: 5s, Training: 48s. Estimated remaining time: 15h 35m 14s. Estimated total time: 15h 39m 8s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 34s. [2026-03-25 14:28:28,143][__main__][INFO] - Starting iteration 1. [2026-03-25 14:28:28,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:28:28,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:28:34,420][__main__][INFO] - Number of regex retries in iteration 1: 0 [2026-03-25 14:28:34,421][__main__][INFO] - agents played in iteration 1 are Bob, Alice [2026-03-25 14:28:34,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:28:34,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:28:34,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:28:34,928][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:28:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:28:36,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:28:36,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:28:37,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:28:38,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:28:39,049][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:28:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:28:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:28:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:28:41,869][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:28:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:28:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:28:43,988][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:28:44,695][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:28:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:28:46,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:28:46,822][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:28:47,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:28:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:28:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:28:49,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:28:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:28:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:28:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:28:52,481][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:28:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:28:53,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:28:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:28:55,311][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:28:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:28:56,731][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:28:57,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:28:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:28:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:28:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:29:00,272][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:29:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:29:01,691][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:29:02,399][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:29:03,107][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:29:03,815][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:29:04,525][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:29:05,234][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:29:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:29:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:29:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:29:08,077][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:29:08,788][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:29:09,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:29:10,541][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:29:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:29:11,965][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:29:12,676][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:29:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:29:14,100][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:29:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:29:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:29:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:29:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:29:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:29:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:29:19,080][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:29:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:29:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:29:21,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:29:21,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:29:23,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:29:23,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:29:23,085][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:29:24,416][__main__][INFO] - Iteration 2 took 56s (11.14% Gen, 86.48% Train). Generation: 6s, Training: 48s. Estimated remaining time: 15h 32m 59s. Estimated total time: 15h 37m 49s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 54s. [2026-03-25 14:29:24,421][__main__][INFO] - Starting iteration 2. [2026-03-25 14:29:24,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:29:24,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:29:29,690][__main__][INFO] - Number of regex retries in iteration 2: 0 [2026-03-25 14:29:29,692][__main__][INFO] - agents played in iteration 2 are Bob, Alice [2026-03-25 14:29:30,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:29:30,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:29:30,224][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:29:30,225][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:29:30,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:29:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:29:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:29:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:29:33,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:29:34,398][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:29:35,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:29:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:29:36,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:29:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:29:37,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:29:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:29:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:29:40,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:29:40,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:29:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:29:42,237][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:29:42,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:29:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:29:44,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:29:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:29:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:29:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:29:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:29:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:29:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:29:49,365][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:29:50,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:29:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:29:51,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:29:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:29:52,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:29:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:29:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:29:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:29:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:29:56,507][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:29:57,221][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:29:57,937][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:29:58,651][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:29:59,368][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:30:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:30:00,799][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:30:01,515][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:30:02,231][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:30:02,945][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:30:03,663][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:30:04,379][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:30:05,421][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:30:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:30:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:30:07,568][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:30:08,284][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:30:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:30:09,717][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:30:10,432][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:30:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:30:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:30:12,582][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:30:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:30:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:30:14,730][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:30:15,447][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:30:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:30:16,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:30:17,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:30:18,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:30:18,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:30:18,983][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:30:20,611][__main__][INFO] - Iteration 3 took 56s (9.37% Gen, 87.73% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 30m 41s. Estimated total time: 15h 36m 27s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 13s. [2026-03-25 14:30:20,614][__main__][INFO] - Starting iteration 3. [2026-03-25 14:30:20,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:30:20,621][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:30:29,841][__main__][INFO] - Number of regex retries in iteration 3: 0 [2026-03-25 14:30:29,842][__main__][INFO] - agents played in iteration 3 are Bob, Alice [2026-03-25 14:30:30,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:30:30,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:30:30,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:30:30,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:30:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:30:31,675][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:30:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:30:33,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:30:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:30:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:30:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:30:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:30:36,664][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:30:37,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:30:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:30:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:30:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:30:40,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:30:40,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:30:41,664][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:30:42,380][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:30:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:30:43,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:30:44,529][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:30:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:30:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:30:46,674][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:30:47,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:30:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:30:48,820][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:30:49,536][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:30:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:30:50,968][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:30:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:30:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:30:53,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:30:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:30:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:30:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:30:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:30:56,703][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:30:57,419][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:30:58,136][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:30:58,855][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:30:59,573][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:31:00,293][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:31:01,009][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:31:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:31:02,441][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:31:03,156][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:31:03,871][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:31:04,588][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:31:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:31:06,334][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:31:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:31:07,766][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:31:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:31:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:31:09,917][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:31:10,634][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:31:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:31:12,067][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:31:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:31:13,500][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:31:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:31:14,934][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:31:15,649][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:31:16,368][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:31:17,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:31:17,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:31:18,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:31:18,863][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:31:18,864][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:31:20,263][__main__][INFO] - Iteration 4 took 59s (15.46% Gen, 82.19% Train). Generation: 9s, Training: 49s. Estimated remaining time: 16h 27m 19s. Estimated total time: 16h 34m 5s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 24s, 500 more iterations: 8h 17m 2s. [2026-03-25 14:31:20,266][__main__][INFO] - Starting iteration 4. [2026-03-25 14:31:20,269][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:31:20,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:31:25,526][__main__][INFO] - Number of regex retries in iteration 4: 0 [2026-03-25 14:31:25,527][__main__][INFO] - agents played in iteration 4 are Bob, Alice [2026-03-25 14:31:25,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:31:26,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:31:26,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:31:26,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:31:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:31:27,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:31:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:31:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:31:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:31:30,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:31:30,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:31:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:31:32,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:31:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:31:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:31:34,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:31:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:31:35,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:31:36,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:31:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:31:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:31:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:31:39,534][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:31:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:31:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:31:41,682][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:31:42,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:31:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:31:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:31:44,547][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:31:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:31:45,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:31:46,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:31:47,416][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:31:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:31:48,851][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:31:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:31:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:31:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:31:51,722][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:31:52,438][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:31:53,157][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:31:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:31:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:31:55,309][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:31:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:31:56,745][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:31:57,462][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:31:58,182][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:31:58,899][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:31:59,617][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:32:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:32:01,282][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:32:02,000][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:32:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:32:03,435][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:32:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:32:04,872][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:32:05,589][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:32:06,309][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:32:07,025][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:32:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:32:08,462][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:32:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:32:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:32:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:32:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:32:12,056][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:32:12,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:32:13,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:32:14,651][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:32:14,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:32:14,656][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:32:15,958][__main__][INFO] - Iteration 5 took 55s (9.44% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 20m 28s. Estimated total time: 15h 28m 10s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 5s. [2026-03-25 14:32:15,960][__main__][INFO] - Starting iteration 5. [2026-03-25 14:32:15,964][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:32:15,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:32:21,320][__main__][INFO] - Number of regex retries in iteration 5: 0 [2026-03-25 14:32:21,321][__main__][INFO] - agents played in iteration 5 are Bob, Alice [2026-03-25 14:32:21,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:32:21,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:32:21,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:32:21,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:32:22,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:32:23,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:32:23,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:32:24,637][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:32:25,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:32:26,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:32:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:32:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:32:28,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:32:28,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:32:29,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:32:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:32:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:32:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:32:32,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:32:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:32:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:32:34,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:32:35,406][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:32:36,121][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:32:36,839][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:32:37,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:32:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:32:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:32:39,711][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:32:40,429][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:32:41,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:32:41,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:32:42,580][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:32:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:32:44,015][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:32:44,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:32:45,454][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:32:46,173][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:32:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:32:47,611][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:32:48,329][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:32:49,046][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:32:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:32:50,484][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:32:51,203][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:32:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:32:52,640][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:32:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:32:54,077][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:32:54,795][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:32:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:32:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:32:57,175][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:32:57,896][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:32:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:32:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:33:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:33:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:33:01,489][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:33:02,208][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:33:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:33:03,646][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:33:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:33:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:33:05,803][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:33:06,521][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:33:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:33:07,960][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:33:08,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:33:09,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:33:10,637][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:33:10,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:33:10,644][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:33:12,106][__main__][INFO] - Iteration 6 took 56s (9.54% Gen, 87.85% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 27m 5s. Estimated total time: 15h 35m 43s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 34s, 500 more iterations: 7h 47m 51s. [2026-03-25 14:33:12,109][__main__][INFO] - Starting iteration 6. [2026-03-25 14:33:12,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:33:12,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:33:17,343][__main__][INFO] - Number of regex retries in iteration 6: 0 [2026-03-25 14:33:17,345][__main__][INFO] - agents played in iteration 6 are Bob, Alice [2026-03-25 14:33:17,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:33:17,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:33:17,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:33:17,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:33:18,582][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:33:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:33:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:33:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:33:21,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:33:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:33:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:33:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:33:24,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:33:24,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:33:25,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:33:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:33:27,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:33:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:33:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:33:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:33:29,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:33:30,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:33:31,427][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:33:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:33:32,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:33:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:33:34,300][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:33:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:33:35,737][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:33:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:33:37,176][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:33:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:33:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:33:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:33:40,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:33:40,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:33:41,492][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:33:42,210][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:33:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:33:43,649][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:33:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:33:45,089][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:33:45,808][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:33:46,528][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:33:47,248][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:33:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:33:48,687][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:33:49,407][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:33:50,126][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:33:50,847][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:33:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:33:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:33:53,321][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:33:54,041][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:33:54,761][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:33:55,480][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:33:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:33:56,920][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:33:57,639][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:33:58,360][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:33:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:33:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:34:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:34:01,239][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:34:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:34:02,681][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:34:03,401][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:34:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:34:04,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:34:05,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:34:06,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:34:06,532][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:34:06,533][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:34:07,812][__main__][INFO] - Iteration 7 took 55s (9.39% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 18m 48s. Estimated total time: 15h 28m 21s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 10s. [2026-03-25 14:34:07,815][__main__][INFO] - Starting iteration 7. [2026-03-25 14:34:07,819][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:34:07,820][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:34:13,081][__main__][INFO] - Number of regex retries in iteration 7: 0 [2026-03-25 14:34:13,082][__main__][INFO] - agents played in iteration 7 are Bob, Alice [2026-03-25 14:34:13,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:34:13,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:34:13,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:34:13,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:34:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:34:14,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:34:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:34:16,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:34:17,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:34:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:34:18,542][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:34:19,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:34:19,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:34:20,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:34:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:34:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:34:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:34:23,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:34:24,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:34:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:34:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:34:26,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:34:27,160][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:34:27,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:34:28,599][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:34:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:34:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:34:30,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:34:31,475][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:34:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:34:32,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:34:33,633][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:34:34,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:34:35,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:34:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:34:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:34:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:34:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:34:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:34:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:34:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:34:40,834][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:34:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:34:42,274][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:34:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:34:43,714][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:34:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:34:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:34:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:34:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:34:47,314][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:34:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:34:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:34:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:34:50,416][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:34:51,138][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:34:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:34:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:34:53,299][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:34:54,018][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:34:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:34:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:34:56,179][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:34:56,900][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:34:57,623][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:34:58,342][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:34:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:34:59,784][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:35:00,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:35:01,212][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:35:02,215][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:35:02,218][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:35:02,220][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:35:03,694][__main__][INFO] - Iteration 8 took 55s (9.42% Gen, 87.94% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 20m 47s. Estimated total time: 15h 31m 17s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 7s, 500 more iterations: 7h 45m 38s. [2026-03-25 14:35:03,697][__main__][INFO] - Starting iteration 8. [2026-03-25 14:35:03,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:35:03,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:35:09,201][__main__][INFO] - Number of regex retries in iteration 8: 0 [2026-03-25 14:35:09,202][__main__][INFO] - agents played in iteration 8 are Bob, Alice [2026-03-25 14:35:09,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:35:09,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:35:09,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:35:09,730][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:35:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:35:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:35:11,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:35:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:35:13,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:35:13,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:35:14,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:35:15,365][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:35:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:35:16,804][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:35:17,522][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:35:18,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:35:18,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:35:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:35:20,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:35:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:35:21,838][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:35:22,559][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:35:23,279][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:35:23,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:35:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:35:25,438][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:35:26,158][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:35:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:35:27,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:35:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:35:29,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:35:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:35:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:35:31,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:35:31,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:35:32,641][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:35:33,360][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:35:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:35:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:35:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:35:36,243][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:35:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:35:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:35:38,406][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:35:39,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:35:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:35:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:35:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:35:42,007][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:35:42,728][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:35:43,449][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:35:44,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:35:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:35:45,908][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:35:46,627][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:35:47,349][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:35:48,070][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:35:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:35:49,510][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:35:50,230][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:35:50,953][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:35:51,674][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:35:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:35:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:35:53,838][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:35:54,562][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:35:55,284][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:35:56,005][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:35:56,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:35:57,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:35:58,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:35:58,532][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:35:58,534][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:35:59,824][__main__][INFO] - Iteration 9 took 56s (9.80% Gen, 87.89% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 23m 59s. Estimated total time: 15h 35m 24s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 32s, 500 more iterations: 7h 47m 42s. [2026-03-25 14:35:59,826][__main__][INFO] - Starting iteration 9. [2026-03-25 14:35:59,830][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:35:59,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:36:05,111][__main__][INFO] - Number of regex retries in iteration 9: 0 [2026-03-25 14:36:05,112][__main__][INFO] - agents played in iteration 9 are Bob, Alice [2026-03-25 14:36:05,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:36:05,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:36:05,667][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:36:05,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:36:06,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:36:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:36:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:36:08,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:36:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:36:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:36:10,584][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:36:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:36:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:36:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:36:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:36:14,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:36:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:36:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:36:16,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:36:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:36:17,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:36:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:36:19,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:36:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:36:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:36:21,376][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:36:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:36:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:36:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:36:24,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:36:24,977][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:36:25,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:36:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:36:27,140][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:36:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:36:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:36:29,304][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:36:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:36:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:36:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:36:32,186][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:36:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:36:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:36:34,349][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:36:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:36:35,789][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:36:36,510][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:36:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:36:37,952][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:36:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:36:39,395][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:36:40,117][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:36:41,088][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:36:41,813][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:36:42,532][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:36:43,253][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:36:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:36:44,696][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:36:45,420][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:36:46,142][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:36:46,864][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:36:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:36:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:36:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:36:49,755][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:36:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:36:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:36:51,919][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:36:52,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:36:53,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:36:54,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:36:54,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:36:54,567][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:36:55,846][__main__][INFO] - Iteration 10 took 56s (9.43% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 21m 15s. Estimated total time: 15h 33m 37s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 21s, 500 more iterations: 7h 46m 48s. [2026-03-25 14:36:55,848][__main__][INFO] - Starting iteration 10. [2026-03-25 14:36:55,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:36:55,852][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:37:02,984][__main__][INFO] - Number of regex retries in iteration 10: 0 [2026-03-25 14:37:02,985][__main__][INFO] - agents played in iteration 10 are Bob, Alice [2026-03-25 14:37:03,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:37:03,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:37:03,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:37:03,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:37:04,195][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:37:04,842][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:37:05,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:37:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:37:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:37:07,713][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:37:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:37:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:37:09,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:37:10,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:37:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:37:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:37:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:37:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:37:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:37:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:37:15,613][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:37:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:37:17,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:37:17,771][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:37:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:37:19,210][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:37:19,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:37:20,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:37:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:37:22,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:37:22,806][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:37:23,527][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:37:24,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:37:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:37:25,686][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:37:26,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:37:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:37:27,848][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:37:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:37:29,291][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:37:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:37:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:37:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:37:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:37:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:37:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:37:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:37:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:37:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:37:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:37:37,226][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:37:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:37:38,896][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:37:39,619][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:37:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:37:41,062][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:37:41,783][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:37:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:37:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:37:43,946][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:37:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:37:45,387][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:37:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:37:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:37:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:37:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:37:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:37:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:37:50,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:37:51,141][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:37:52,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:37:52,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:37:52,585][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:37:53,903][__main__][INFO] - Iteration 11 took 58s (12.29% Gen, 85.44% Train). Generation: 7s, Training: 49s. Estimated remaining time: 15h 54m 13s. Estimated total time: 16h 7m 32s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 45s, 500 more iterations: 8h 3m 46s. [2026-03-25 14:37:53,906][__main__][INFO] - Starting iteration 11. [2026-03-25 14:37:53,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:37:53,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:37:59,264][__main__][INFO] - Number of regex retries in iteration 11: 0 [2026-03-25 14:37:59,265][__main__][INFO] - agents played in iteration 11 are Bob, Alice [2026-03-25 14:37:59,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:37:59,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:37:59,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:37:59,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:38:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:38:01,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:38:01,868][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:38:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:38:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:38:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:38:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:38:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:38:06,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:38:06,903][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:38:07,622][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:38:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:38:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:38:09,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:38:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:38:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:38:11,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:38:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:38:13,372][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:38:14,093][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:38:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:38:15,529][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:38:16,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:38:16,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:38:17,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:38:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:38:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:38:19,846][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:38:20,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:38:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:38:22,005][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:38:22,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:38:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:38:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:38:24,886][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:38:25,605][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:38:26,328][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:38:27,050][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:38:27,771][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:38:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:38:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:38:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:38:30,655][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:38:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:38:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:38:32,822][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:38:33,543][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:38:34,263][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:38:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:38:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:38:36,726][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:38:37,448][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:38:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:38:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:38:39,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:38:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:38:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:38:41,777][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:38:42,499][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:38:43,221][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:38:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:38:44,665][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:38:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:38:46,110][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:38:46,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:38:47,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:38:48,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:38:48,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:38:48,646][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:38:49,933][__main__][INFO] - Iteration 12 took 56s (9.56% Gen, 88.14% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 19m 29s. Estimated total time: 15h 33m 45s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 52s. [2026-03-25 14:38:49,936][__main__][INFO] - Starting iteration 12. [2026-03-25 14:38:49,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:38:49,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:38:55,338][__main__][INFO] - Number of regex retries in iteration 12: 0 [2026-03-25 14:38:55,339][__main__][INFO] - agents played in iteration 12 are Bob, Alice [2026-03-25 14:38:55,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:38:55,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:38:55,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:38:55,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:38:56,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:38:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:38:57,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:38:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:38:59,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:39:00,063][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:39:00,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:39:01,500][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:39:02,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:39:02,937][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:39:03,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:39:04,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:39:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:39:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:39:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:39:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:39:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:39:08,691][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:39:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:39:10,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:39:10,850][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:39:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:39:12,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:39:13,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:39:13,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:39:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:39:15,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:39:15,884][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:39:16,607][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:39:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:39:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:39:18,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:39:19,485][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:39:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:39:20,925][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:39:21,645][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:39:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:39:23,089][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:39:23,809][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:39:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:39:25,249][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:39:25,969][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:39:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:39:27,409][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:39:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:39:28,851][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:39:29,571][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:39:30,291][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:39:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:39:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:39:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:39:33,396][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:39:34,117][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:39:34,837][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:39:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:39:36,279][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:39:37,000][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:39:37,721][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:39:38,441][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:39:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:39:39,888][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:39:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:39:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:39:42,055][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:39:42,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:39:43,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:39:44,518][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:39:44,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:39:44,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:39:45,825][__main__][INFO] - Iteration 13 took 55s (9.66% Gen, 88.01% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 16m 15s. Estimated total time: 15h 31m 27s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 43s. [2026-03-25 14:39:45,828][__main__][INFO] - Starting iteration 13. [2026-03-25 14:39:45,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:39:45,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:39:51,236][__main__][INFO] - Number of regex retries in iteration 13: 0 [2026-03-25 14:39:51,238][__main__][INFO] - agents played in iteration 13 are Bob, Alice [2026-03-25 14:39:51,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:39:51,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:39:51,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:39:51,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:39:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:39:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:39:53,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:39:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:39:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:39:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:39:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:39:57,426][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:39:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:39:58,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:39:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:40:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:40:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:40:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:40:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:40:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:40:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:40:04,626][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:40:05,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:40:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:40:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:40:07,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:40:08,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:40:08,943][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:40:09,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:40:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:40:11,102][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:40:11,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:40:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:40:13,265][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:40:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:40:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:40:15,423][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:40:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:40:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:40:17,584][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:40:18,304][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:40:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:40:19,745][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:40:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:40:21,187][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:40:21,907][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:40:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:40:23,347][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:40:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:40:24,788][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:40:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:40:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:40:27,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:40:27,890][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:40:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:40:29,331][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:40:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:40:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:40:31,494][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:40:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:40:32,936][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:40:33,658][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:40:34,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:40:35,100][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:40:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:40:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:40:37,264][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:40:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:40:38,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:40:39,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:40:40,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:40:40,653][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:40:40,654][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:40:41,999][__main__][INFO] - Iteration 14 took 56s (9.62% Gen, 87.98% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 20m 1s. Estimated total time: 15h 36m 8s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 4s. [2026-03-25 14:40:42,002][__main__][INFO] - Starting iteration 14. [2026-03-25 14:40:42,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:40:42,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:40:47,498][__main__][INFO] - Number of regex retries in iteration 14: 0 [2026-03-25 14:40:47,500][__main__][INFO] - agents played in iteration 14 are Bob, Alice [2026-03-25 14:40:47,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:40:48,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:40:48,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:40:48,026][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:40:48,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:40:49,379][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:40:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:40:50,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:40:51,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:40:52,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:40:52,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:40:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:40:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:40:55,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:40:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:40:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:40:57,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:40:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:40:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:40:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:41:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:41:00,876][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:41:01,597][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:41:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:41:03,036][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:41:03,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:41:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:41:05,194][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:41:05,914][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:41:06,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:41:07,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:41:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:41:08,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:41:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:41:10,233][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:41:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:41:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:41:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:41:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:41:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:41:14,552][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:41:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:41:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:41:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:41:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:41:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:41:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:41:19,596][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:41:20,317][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:41:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:41:21,758][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:41:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:41:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:41:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:41:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:41:25,661][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:41:26,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:41:27,101][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:41:27,823][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:41:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:41:29,264][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:41:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:41:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:41:31,427][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:41:32,146][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:41:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:41:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:41:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:41:35,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:41:35,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:41:36,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:41:36,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:41:36,686][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:41:38,142][__main__][INFO] - Iteration 15 took 56s (9.78% Gen, 87.62% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 18m 34s. Estimated total time: 15h 35m 37s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 33s, 500 more iterations: 7h 47m 48s. [2026-03-25 14:41:38,146][__main__][INFO] - Starting iteration 15. [2026-03-25 14:41:38,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:41:38,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:41:43,434][__main__][INFO] - Number of regex retries in iteration 15: 0 [2026-03-25 14:41:43,435][__main__][INFO] - agents played in iteration 15 are Bob, Alice [2026-03-25 14:41:43,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:41:43,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:41:43,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:41:43,956][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:41:44,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:41:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:41:46,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:41:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:41:47,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:41:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:41:48,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:41:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:41:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:41:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:41:51,749][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:41:52,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:41:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:41:53,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:41:54,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:41:55,344][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:41:56,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:41:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:41:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:41:58,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:41:58,941][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:41:59,659][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:42:00,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:42:01,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:42:01,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:42:02,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:42:03,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:42:03,975][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:42:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:42:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:42:06,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:42:06,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:42:07,573][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:42:08,292][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:42:09,015][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:42:09,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:42:10,454][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:42:11,176][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:42:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:42:12,617][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:42:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:42:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:42:14,779][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:42:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:42:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:42:16,940][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:42:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:42:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:42:19,316][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:42:20,037][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:42:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:42:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:42:22,198][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:42:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:42:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:42:24,361][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:42:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:42:25,802][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:42:26,523][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:42:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:42:27,965][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:42:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:42:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:42:30,128][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:42:30,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:42:31,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:42:32,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:42:32,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:42:32,623][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:42:33,975][__main__][INFO] - Iteration 16 took 55s (9.46% Gen, 88.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 12m 26s. Estimated total time: 15h 30m 26s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 13s. [2026-03-25 14:42:33,979][__main__][INFO] - Starting iteration 16. [2026-03-25 14:42:33,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:42:33,987][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:42:39,252][__main__][INFO] - Number of regex retries in iteration 16: 0 [2026-03-25 14:42:39,253][__main__][INFO] - agents played in iteration 16 are Bob, Alice [2026-03-25 14:42:39,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:42:39,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:42:39,778][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:42:39,779][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:42:40,461][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:42:41,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:42:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:42:42,546][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:42:43,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:42:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:42:44,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:42:45,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:42:46,136][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:42:46,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:42:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:42:48,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:42:49,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:42:49,728][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:42:50,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:42:51,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:42:51,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:42:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:42:53,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:42:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:42:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:42:55,479][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:42:56,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:42:56,919][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:42:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:42:58,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:42:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:42:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:43:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:43:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:43:01,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:43:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:43:03,395][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:43:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:43:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:43:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:43:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:43:06,994][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:43:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:43:08,437][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:43:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:43:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:43:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:43:11,319][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:43:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:43:12,761][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:43:13,483][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:43:14,203][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:43:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:43:15,870][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:43:16,590][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:43:17,313][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:43:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:43:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:43:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:43:20,195][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:43:20,914][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:43:21,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:43:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:43:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:43:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:43:24,519][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:43:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:43:25,961][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:43:26,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:43:27,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:43:28,538][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:43:28,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:43:28,544][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:43:29,878][__main__][INFO] - Iteration 17 took 55s (9.42% Gen, 88.18% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 12m 40s. Estimated total time: 15h 31m 35s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 47s. [2026-03-25 14:43:29,881][__main__][INFO] - Starting iteration 17. [2026-03-25 14:43:29,884][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:43:29,885][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:43:35,191][__main__][INFO] - Number of regex retries in iteration 17: 0 [2026-03-25 14:43:35,192][__main__][INFO] - agents played in iteration 17 are Bob, Alice [2026-03-25 14:43:35,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:43:35,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:43:35,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:43:35,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:43:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:43:37,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:43:37,758][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:43:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:43:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:43:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:43:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:43:41,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:43:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:43:42,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:43:43,506][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:43:44,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:43:44,945][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:43:45,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:43:46,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:43:47,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:43:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:43:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:43:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:43:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:43:50,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:43:51,420][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:43:52,138][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:43:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:43:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:43:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:43:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:43:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:43:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:43:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:43:57,897][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:43:58,618][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:43:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:44:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:44:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:44:01,496][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:44:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:44:02,938][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:44:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:44:04,378][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:44:05,098][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:44:05,818][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:44:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:44:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:44:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:44:08,699][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:44:09,420][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:44:10,142][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:44:11,125][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:44:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:44:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:44:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:44:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:44:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:44:15,449][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:44:16,168][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:44:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:44:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:44:18,331][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:44:19,052][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:44:19,773][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:44:20,494][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:44:21,214][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:44:21,935][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:44:22,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:44:23,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:44:24,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:44:24,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:44:24,590][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:44:25,826][__main__][INFO] - Iteration 18 took 55s (9.49% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 12m 31s. Estimated total time: 15h 32m 23s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 11s. [2026-03-25 14:44:25,828][__main__][INFO] - Starting iteration 18. [2026-03-25 14:44:25,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:44:25,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:44:31,342][__main__][INFO] - Number of regex retries in iteration 18: 0 [2026-03-25 14:44:31,343][__main__][INFO] - agents played in iteration 18 are Bob, Alice [2026-03-25 14:44:31,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:44:31,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:44:31,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:44:31,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:44:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:44:33,217][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:44:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:44:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:44:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:44:36,090][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:44:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:44:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:44:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:44:38,964][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:44:39,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:44:40,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:44:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:44:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:44:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:44:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:44:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:44:44,710][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:44:45,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:44:46,147][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:44:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:44:47,585][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:44:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:44:49,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:44:49,741][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:44:50,459][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:44:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:44:51,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:44:52,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:44:53,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:44:54,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:44:54,778][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:44:55,498][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:44:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:44:56,936][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:44:57,656][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:44:58,376][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:44:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:44:59,816][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:45:00,536][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:45:01,255][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:45:01,977][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:45:02,696][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:45:03,416][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:45:04,138][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:45:04,857][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:45:05,577][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:45:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:45:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:45:07,962][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:45:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:45:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:45:10,125][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:45:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:45:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:45:12,287][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:45:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:45:13,727][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:45:14,449][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:45:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:45:15,888][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:45:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:45:17,329][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:45:18,050][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:45:18,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:45:19,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:45:20,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:45:20,674][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:45:20,676][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:45:22,012][__main__][INFO] - Iteration 19 took 56s (9.81% Gen, 87.81% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 15m 34s. Estimated total time: 15h 36m 21s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 10s. [2026-03-25 14:45:22,016][__main__][INFO] - Starting iteration 19. [2026-03-25 14:45:22,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:45:22,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:45:27,382][__main__][INFO] - Number of regex retries in iteration 19: 0 [2026-03-25 14:45:27,383][__main__][INFO] - agents played in iteration 19 are Bob, Alice [2026-03-25 14:45:27,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:45:27,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:45:27,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:45:27,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:45:28,650][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:45:29,297][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:45:30,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:45:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:45:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:45:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:45:32,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:45:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:45:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:45:35,044][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:45:35,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:45:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:45:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:45:37,918][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:45:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:45:39,355][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:45:40,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:45:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:45:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:45:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:45:42,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:45:43,670][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:45:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:45:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:45:45,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:45:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:45:47,266][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:45:47,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:45:48,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:45:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:45:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:45:50,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:45:51,583][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:45:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:45:53,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:45:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:45:54,461][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:45:55,180][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:45:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:45:56,620][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:45:57,339][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:45:58,059][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:45:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:45:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:46:00,220][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:46:00,941][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:46:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:46:02,378][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:46:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:46:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:46:04,830][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:46:05,549][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:46:06,269][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:46:06,990][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:46:07,709][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:46:14,126][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:46:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:46:17,464][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:46:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:46:18,900][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:46:19,617][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:46:20,334][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:46:21,051][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:46:21,768][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:46:22,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:46:23,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:54 [2026-03-25 14:46:24,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:46:24,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:46:24,627][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:46:25,960][__main__][INFO] - Iteration 20 took 1m 3s (8.38% Gen, 89.53% Train). Generation: 5s, Training: 57s. Estimated remaining time: 17h 23m 50s. Estimated total time: 17h 45m 42s. Time estimates for 10 more iterations: 10m 39s, 100 more iterations: 1h 46m 34s, 500 more iterations: 8h 52m 51s. [2026-03-25 14:46:25,963][__main__][INFO] - Starting iteration 20. [2026-03-25 14:46:25,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:46:25,969][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:46:31,353][__main__][INFO] - Number of regex retries in iteration 20: 0 [2026-03-25 14:46:31,354][__main__][INFO] - agents played in iteration 20 are Bob, Alice [2026-03-25 14:46:31,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:46:31,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:46:31,938][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:46:31,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:46:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:46:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:46:33,990][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:46:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:46:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:46:36,139][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:46:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:46:37,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:46:38,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:46:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:46:39,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:46:40,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:46:41,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:46:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:46:42,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:46:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:46:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:46:44,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:46:45,462][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:46:46,177][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:46:46,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:46:47,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:46:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:46:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:46:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:46:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:46:51,195][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:46:51,913][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:46:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:46:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:46:54,066][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:46:54,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:46:55,500][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:46:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:46:56,938][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:46:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:46:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:46:59,092][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:46:59,810][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:47:00,530][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:47:01,246][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:47:01,966][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:47:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:47:03,404][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:47:04,122][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:47:04,840][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:47:05,558][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:47:06,278][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:47:07,221][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:47:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:47:08,660][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:47:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:47:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:47:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:47:11,537][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:47:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:47:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:47:13,694][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:47:14,414][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:47:15,133][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:47:15,851][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:47:16,572][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:47:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:47:18,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:47:18,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:47:19,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:47:20,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:47:20,675][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:47:20,678][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:47:21,914][__main__][INFO] - Iteration 21 took 55s (9.62% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 9m 41s. Estimated total time: 15h 32m 28s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 14s. [2026-03-25 14:47:21,917][__main__][INFO] - Starting iteration 21. [2026-03-25 14:47:21,922][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:47:21,923][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:47:27,156][__main__][INFO] - Number of regex retries in iteration 21: 0 [2026-03-25 14:47:27,157][__main__][INFO] - agents played in iteration 21 are Bob, Alice [2026-03-25 14:47:27,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:47:27,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:47:27,691][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:47:27,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:47:28,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:47:29,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:47:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:47:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:47:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:47:31,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:47:32,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:47:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:47:34,041][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:47:34,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:47:35,478][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:47:36,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:47:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:47:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:47:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:47:39,067][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:47:39,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:47:40,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:47:41,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:47:41,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:47:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:47:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:47:44,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:47:44,814][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:47:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:47:46,251][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:47:46,968][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:47:47,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:47:48,405][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:47:49,125][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:47:49,844][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:47:50,563][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:47:51,283][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:47:52,001][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:47:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:47:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:47:54,160][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:47:54,880][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:47:55,599][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:47:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:47:57,037][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:47:57,757][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:47:58,478][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:47:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:47:59,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:48:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:48:01,356][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:48:02,075][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:48:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:48:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:48:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:48:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:48:05,892][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:48:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:48:07,332][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:48:08,050][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:48:08,772][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:48:09,494][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:48:10,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:48:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:48:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:48:12,374][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:48:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:48:13,814][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:48:14,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:48:15,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:48:16,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:48:16,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:48:16,651][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:48:19,499][__main__][INFO] - Iteration 22 took 57s (9.09% Gen, 85.96% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 35m 54s. Estimated total time: 15h 59m 40s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 58s, 500 more iterations: 7h 59m 50s. [2026-03-25 14:48:19,502][__main__][INFO] - Starting iteration 22. [2026-03-25 14:48:19,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:48:19,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:48:25,602][__main__][INFO] - Number of regex retries in iteration 22: 0 [2026-03-25 14:48:25,604][__main__][INFO] - agents played in iteration 22 are Bob, Alice [2026-03-25 14:48:26,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:48:26,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:48:26,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:48:26,135][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:48:26,811][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:48:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:48:28,295][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:48:29,013][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:48:29,734][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:48:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:48:31,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:48:31,885][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:48:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:48:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:48:34,034][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:48:34,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:48:35,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:48:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:48:36,904][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:48:37,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:48:38,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:48:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:48:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:48:40,495][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:48:41,212][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:48:41,929][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:48:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:48:43,365][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:48:44,084][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:48:44,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:48:45,524][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:48:46,243][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:48:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:48:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:48:48,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:48:49,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:48:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:48:50,560][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:48:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:48:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:48:52,719][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:48:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:48:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:48:54,878][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:48:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:48:56,315][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:48:57,037][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:48:57,755][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:48:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:48:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:48:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:49:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:49:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:49:02,394][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:49:03,113][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:49:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:49:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:49:05,272][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:49:05,993][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:49:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:49:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:49:08,153][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:49:08,876][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:49:09,596][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:49:10,316][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:49:11,035][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:49:11,756][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:49:12,474][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:49:13,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:49:13,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:49:15,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:49:15,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:49:15,446][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:49:17,100][__main__][INFO] - Iteration 23 took 57s (10.58% Gen, 86.54% Train). Generation: 6s, Training: 49s. Estimated remaining time: 15h 35m 13s. Estimated total time: 15h 59m 55s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 59s, 500 more iterations: 7h 59m 57s. [2026-03-25 14:49:17,105][__main__][INFO] - Starting iteration 23. [2026-03-25 14:49:17,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:49:17,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:49:22,455][__main__][INFO] - Number of regex retries in iteration 23: 0 [2026-03-25 14:49:22,456][__main__][INFO] - agents played in iteration 23 are Bob, Alice [2026-03-25 14:49:22,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:49:22,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:49:22,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:49:22,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:49:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:49:24,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:49:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:49:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:49:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:49:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:49:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:49:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:49:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:49:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:49:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:49:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:49:32,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:49:32,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:49:33,643][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:49:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:49:35,078][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:49:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:49:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:49:37,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:49:37,949][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:49:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:49:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:49:40,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:49:40,826][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:49:41,545][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:49:42,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:49:42,981][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:49:43,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:49:44,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:49:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:49:45,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:49:46,576][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:49:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:49:48,014][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:49:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:49:49,453][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:49:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:49:50,891][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:49:51,611][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:49:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:49:53,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:49:53,768][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:49:54,487][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:49:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:49:55,926][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:49:56,646][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:49:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:49:58,305][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:49:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:49:59,745][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:50:00,465][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:50:01,185][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:50:01,903][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:50:02,623][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:50:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:50:04,064][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:50:04,783][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:50:05,503][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:50:06,223][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:50:06,942][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:50:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:50:08,383][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:50:09,105][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:50:09,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:50:10,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:50:12,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:50:12,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:50:12,226][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:50:13,521][__main__][INFO] - Iteration 24 took 56s (9.47% Gen, 88.23% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 14m 33s. Estimated total time: 15h 40m 12s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 6s. [2026-03-25 14:50:13,524][__main__][INFO] - Starting iteration 24. [2026-03-25 14:50:13,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:50:13,530][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:50:18,768][__main__][INFO] - Number of regex retries in iteration 24: 0 [2026-03-25 14:50:18,769][__main__][INFO] - agents played in iteration 24 are Bob, Alice [2026-03-25 14:50:19,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:50:19,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:50:19,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:50:19,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:50:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:50:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:50:21,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:50:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:50:22,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:50:23,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:50:24,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:50:24,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:50:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:50:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:50:27,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:50:27,800][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:50:28,517][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:50:29,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:50:29,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:50:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:50:31,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:50:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:50:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:50:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:50:34,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:50:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:50:35,699][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:50:36,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:50:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:50:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:50:38,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:50:39,292][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:50:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:50:40,730][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:50:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:50:42,171][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:50:42,888][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:50:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:50:44,326][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:50:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:50:45,765][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:50:46,484][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:50:47,203][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:50:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:50:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:50:49,361][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:50:50,080][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:50:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:50:51,520][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:50:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:50:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:50:53,679][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:50:54,626][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:50:55,347][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:50:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:50:56,787][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:50:57,508][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:50:58,227][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:50:58,947][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:50:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:51:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:51:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:51:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:51:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:51:03,267][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:51:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:51:04,708][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:51:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:51:06,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:51:06,868][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:51:08,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:51:08,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:51:08,165][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:51:09,583][__main__][INFO] - Iteration 25 took 56s (9.35% Gen, 88.12% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 7m 42s. Estimated total time: 15h 34m 17s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 8s. [2026-03-25 14:51:09,586][__main__][INFO] - Starting iteration 25. [2026-03-25 14:51:09,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:51:09,591][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:51:14,862][__main__][INFO] - Number of regex retries in iteration 25: 0 [2026-03-25 14:51:14,863][__main__][INFO] - agents played in iteration 25 are Bob, Alice [2026-03-25 14:51:15,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:51:15,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:51:15,394][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:51:15,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:51:16,072][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:51:16,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:51:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:51:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:51:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:51:19,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:51:20,307][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:51:21,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:51:21,743][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:51:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:51:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:51:23,899][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:51:24,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:51:25,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:51:26,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:51:26,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:51:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:51:28,211][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:51:28,930][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:51:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:51:30,367][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:51:31,086][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:51:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:51:32,523][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:51:33,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:51:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:51:34,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:51:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:51:36,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:51:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:51:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:51:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:51:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:51:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:51:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:51:41,154][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:51:41,873][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:51:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:51:43,315][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:51:44,033][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:51:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:51:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:51:46,193][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:51:46,914][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:51:47,633][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:51:48,354][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:51:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:51:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:51:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:51:51,550][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:51:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:51:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:51:53,709][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:51:54,430][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:51:55,150][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:51:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:51:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:51:57,310][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:51:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:51:58,750][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:51:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:52:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:52:00,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:52:01,631][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:52:02,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:52:03,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:52:04,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:52:04,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:52:04,352][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:52:05,591][__main__][INFO] - Iteration 26 took 56s (9.41% Gen, 88.37% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 5m 52s. Estimated total time: 15h 33m 23s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 41s. [2026-03-25 14:52:05,594][__main__][INFO] - Starting iteration 26. [2026-03-25 14:52:05,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:52:05,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:52:10,916][__main__][INFO] - Number of regex retries in iteration 26: 0 [2026-03-25 14:52:10,918][__main__][INFO] - agents played in iteration 26 are Bob, Alice [2026-03-25 14:52:11,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:52:11,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:52:11,445][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:52:11,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:52:12,126][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:52:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:52:13,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:52:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:52:14,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:52:15,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:52:16,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:52:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:52:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:52:18,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:52:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:52:19,956][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:52:20,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:52:21,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:52:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:52:22,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:52:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:52:24,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:52:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:52:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:52:26,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:52:27,145][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:52:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:52:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:52:29,302][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:52:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:52:30,741][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:52:31,460][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:52:32,179][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:52:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:52:33,617][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:52:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:52:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:52:35,774][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:52:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:52:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:52:37,932][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:52:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:52:39,374][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:52:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:52:40,813][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:52:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:52:42,253][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:52:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:52:43,693][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:52:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:52:45,132][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:52:45,853][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:52:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:52:47,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:52:48,235][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:52:48,955][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:52:49,677][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:52:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:52:51,118][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:52:51,838][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:52:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:52:53,281][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:52:54,002][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:52:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:52:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:52:56,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:52:56,884][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:52:57,606][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:52:58,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:52:59,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:53:00,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:53:00,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:53:00,233][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:53:01,538][__main__][INFO] - Iteration 27 took 55s (9.51% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 3m 55s. Estimated total time: 15h 32m 22s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 11s. [2026-03-25 14:53:01,541][__main__][INFO] - Starting iteration 27. [2026-03-25 14:53:01,545][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:53:01,546][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:53:06,749][__main__][INFO] - Number of regex retries in iteration 27: 0 [2026-03-25 14:53:06,750][__main__][INFO] - agents played in iteration 27 are Bob, Alice [2026-03-25 14:53:07,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:53:07,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:53:07,306][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:53:07,307][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:53:07,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:53:08,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:53:09,362][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:53:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:53:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:53:11,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:53:12,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:53:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:53:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:53:14,390][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:53:15,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:53:15,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:53:16,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:53:17,266][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:53:17,985][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:53:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:53:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:53:20,140][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:53:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:53:21,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:53:22,298][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:53:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:53:23,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:53:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:53:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:53:25,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:53:26,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:53:27,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:53:28,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:53:28,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:53:29,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:53:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:53:30,936][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:53:31,656][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:53:32,376][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:53:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:53:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:53:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:53:35,256][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:53:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:53:36,700][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:53:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:53:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:53:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:53:39,590][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:53:40,311][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:53:41,033][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:53:41,756][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:53:42,720][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:53:43,440][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:53:44,160][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:53:44,881][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:53:45,603][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:53:46,323][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:53:47,043][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:53:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:53:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:53:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:53:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:53:50,646][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:53:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:53:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:53:52,808][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:53:53,531][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:53:54,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:53:54,990][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:53:56,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:53:56,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:53:56,170][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:53:57,474][__main__][INFO] - Iteration 28 took 55s (9.31% Gen, 88.36% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 2m 47s. Estimated total time: 15h 32m 10s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 13s, 500 more iterations: 7h 46m 5s. [2026-03-25 14:53:57,478][__main__][INFO] - Starting iteration 28. [2026-03-25 14:53:57,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:53:57,483][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:54:02,800][__main__][INFO] - Number of regex retries in iteration 28: 0 [2026-03-25 14:54:02,801][__main__][INFO] - agents played in iteration 28 are Bob, Alice [2026-03-25 14:54:03,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:54:03,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:54:03,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:54:03,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:54:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:54:04,732][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:54:05,451][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:54:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:54:06,887][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:54:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:54:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:54:09,044][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:54:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:54:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:54:11,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:54:11,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:54:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:54:13,356][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:54:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:54:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:54:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:54:16,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:54:16,953][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:54:17,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:54:18,392][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:54:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:54:19,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:54:20,548][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:54:21,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:54:21,987][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:54:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:54:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:54:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:54:24,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:54:25,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:54:26,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:54:27,023][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:54:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:54:28,465][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:54:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:54:29,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:54:30,625][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:54:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:54:32,065][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:54:32,786][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:54:33,505][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:54:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:54:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:54:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:54:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:54:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:54:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:54:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:54:39,526][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:54:40,245][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:54:40,966][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:54:41,685][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:54:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:54:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:54:43,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:54:44,569][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:54:45,289][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:54:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:54:46,732][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:54:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:54:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:54:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:54:49,615][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:54:50,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:54:51,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:54:52,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:54:52,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:54:52,408][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:54:53,808][__main__][INFO] - Iteration 29 took 56s (9.44% Gen, 88.07% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 8m 27s. Estimated total time: 15h 38m 47s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 52s, 500 more iterations: 7h 49m 23s. [2026-03-25 14:54:53,810][__main__][INFO] - Starting iteration 29. [2026-03-25 14:54:53,816][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:54:53,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:54:59,181][__main__][INFO] - Number of regex retries in iteration 29: 0 [2026-03-25 14:54:59,182][__main__][INFO] - agents played in iteration 29 are Bob, Alice [2026-03-25 14:54:59,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:54:59,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:54:59,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:54:59,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:55:00,438][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:55:01,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:55:01,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:55:02,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:55:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:55:03,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:55:04,678][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:55:05,395][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:55:06,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:55:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:55:07,550][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:55:08,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:55:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:55:09,707][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:55:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:55:11,144][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:55:11,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:55:12,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:55:13,302][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:55:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:55:14,739][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:55:15,460][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:55:16,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:55:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:55:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:55:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:55:19,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:55:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:55:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:55:21,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:55:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:55:22,654][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:55:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:55:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:55:24,813][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:55:25,533][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:55:26,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:55:26,972][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:55:27,692][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:55:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:55:29,132][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:55:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:55:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:55:31,293][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:55:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:55:32,734][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:55:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:55:34,174][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:55:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:55:35,842][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:55:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:55:37,281][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:55:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:55:38,724][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:55:39,445][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:55:40,165][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:55:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:55:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:55:42,329][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:55:43,049][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:55:43,770][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:55:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:55:45,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:55:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:55:46,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:55:47,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:55:48,491][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:55:48,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:55:48,498][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:55:49,852][__main__][INFO] - Iteration 30 took 56s (9.57% Gen, 88.00% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 2m 43s. Estimated total time: 15h 33m 59s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 59s. [2026-03-25 14:55:49,855][__main__][INFO] - Starting iteration 30. [2026-03-25 14:55:49,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:55:49,863][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:55:55,153][__main__][INFO] - Number of regex retries in iteration 30: 0 [2026-03-25 14:55:55,154][__main__][INFO] - agents played in iteration 30 are Bob, Alice [2026-03-25 14:55:55,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:55:55,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:55:55,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:55:55,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:55:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:55:57,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:55:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:55:58,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:55:59,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:55:59,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:56:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:56:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:56:02,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:56:02,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:56:03,497][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:56:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:56:04,934][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:56:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:56:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:56:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:56:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:56:08,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:56:09,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:56:09,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:56:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:56:11,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:56:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:56:12,849][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:56:13,569][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:56:14,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:56:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:56:15,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:56:16,447][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:56:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:56:17,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:56:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:56:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:56:20,043][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:56:20,766][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:56:21,485][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:56:22,205][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:56:22,926][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:56:23,645][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:56:24,366][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:56:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:56:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:56:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:56:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:56:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:56:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:56:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:56:30,128][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:56:31,138][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:56:31,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:56:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:56:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:56:34,018][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:56:34,737][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:56:35,458][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:56:36,178][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:56:36,897][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:56:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:56:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:56:39,061][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:56:39,783][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:56:40,504][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:56:41,225][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:56:41,945][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:56:42,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:56:43,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 14:56:44,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:56:44,786][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:56:44,789][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:56:46,469][__main__][INFO] - Iteration 31 took 56s (9.34% Gen, 87.68% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 11m 20s. Estimated total time: 15h 43m 32s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 21s, 500 more iterations: 7h 51m 46s. [2026-03-25 14:56:46,473][__main__][INFO] - Starting iteration 31. [2026-03-25 14:56:46,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:56:46,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:56:51,668][__main__][INFO] - Number of regex retries in iteration 31: 0 [2026-03-25 14:56:51,669][__main__][INFO] - agents played in iteration 31 are Bob, Alice [2026-03-25 14:56:52,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:56:52,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:56:52,199][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:56:52,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:56:52,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:56:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:56:54,259][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:56:54,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:56:55,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:56:56,413][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:56:57,131][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:56:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:56:58,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:56:59,286][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:57:00,005][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:57:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:57:01,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:57:02,159][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:57:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:57:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:57:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:57:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:57:05,754][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:57:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:57:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:57:07,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:57:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:57:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:57:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:57:10,789][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:57:11,509][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:57:12,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:57:12,946][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:57:13,667][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:57:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:57:15,104][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:57:15,823][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:57:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:57:17,262][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:57:17,981][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:57:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:57:19,423][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:57:20,144][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:57:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:57:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:57:22,307][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:57:23,028][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:57:23,750][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:57:24,469][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:57:25,190][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:57:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:57:26,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:57:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:57:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:57:29,046][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:57:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:57:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:57:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:57:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:57:32,649][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:57:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:57:34,088][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:57:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:57:35,529][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:57:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:57:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:57:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:57:38,412][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:57:39,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:57:39,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:57:41,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:57:41,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:57:41,351][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:57:42,711][__main__][INFO] - Iteration 32 took 56s (9.22% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 4m 6s. Estimated total time: 15h 37m 14s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 43s, 500 more iterations: 7h 48m 37s. [2026-03-25 14:57:42,715][__main__][INFO] - Starting iteration 32. [2026-03-25 14:57:42,719][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:57:42,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:57:47,928][__main__][INFO] - Number of regex retries in iteration 32: 0 [2026-03-25 14:57:47,929][__main__][INFO] - agents played in iteration 32 are Bob, Alice [2026-03-25 14:57:48,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:57:48,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:57:48,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:57:48,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:57:49,145][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:57:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:57:50,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:57:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:57:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:57:52,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:57:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:57:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:57:54,823][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:57:55,541][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:57:56,260][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:57:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:57:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:57:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:57:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:57:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:58:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:58:01,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:58:02,011][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:58:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:58:03,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:58:04,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:58:04,888][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:58:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:58:06,329][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:58:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:58:07,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:58:08,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:58:09,207][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:58:09,927][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:58:10,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:58:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:58:12,085][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:58:12,805][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:58:13,523][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:58:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:58:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:58:15,680][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:58:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:58:17,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:58:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:58:18,558][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:58:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:58:19,995][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:58:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:58:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:58:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:58:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:58:23,821][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:58:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:58:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:58:25,980][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:58:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:58:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:58:28,140][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:58:28,859][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:58:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:58:30,300][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:58:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:58:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:58:32,461][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:58:33,181][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:58:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:58:34,620][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:58:35,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:58:36,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:58:37,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:58:37,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:58:37,463][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:58:39,188][__main__][INFO] - Iteration 33 took 56s (9.23% Gen, 87.72% Train). Generation: 5s, Training: 49s. Estimated remaining time: 15h 7m 5s. Estimated total time: 15h 41m 10s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 7s, 500 more iterations: 7h 50m 35s. [2026-03-25 14:58:39,190][__main__][INFO] - Starting iteration 33. [2026-03-25 14:58:39,195][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:58:39,195][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:58:44,456][__main__][INFO] - Number of regex retries in iteration 33: 0 [2026-03-25 14:58:44,457][__main__][INFO] - agents played in iteration 33 are Bob, Alice [2026-03-25 14:58:44,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:58:44,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:58:44,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:58:44,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:58:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:58:46,339][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:58:47,060][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:58:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:58:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:58:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:58:49,928][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:58:50,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:58:51,363][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:58:52,080][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:58:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:58:53,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:58:54,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:58:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:58:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:58:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:58:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:58:57,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:58:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:58:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:58:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:59:00,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:59:01,415][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:59:02,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:59:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:59:03,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 14:59:04,291][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 14:59:05,008][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 14:59:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 14:59:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 14:59:07,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 14:59:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 14:59:08,600][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 14:59:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 14:59:10,040][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 14:59:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 14:59:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 14:59:12,196][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 14:59:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 14:59:13,633][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 14:59:14,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 14:59:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 14:59:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 14:59:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 14:59:17,228][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 14:59:17,947][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 14:59:18,664][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 14:59:19,384][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 14:59:20,325][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 14:59:21,046][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 14:59:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 14:59:22,484][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 14:59:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 14:59:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 14:59:24,641][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 14:59:25,360][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 14:59:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 14:59:26,800][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 14:59:27,519][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 14:59:28,239][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 14:59:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 14:59:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 14:59:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 14:59:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 14:59:31,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 14:59:32,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 14:59:33,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 14:59:33,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 14:59:33,776][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 14:59:35,144][__main__][INFO] - Iteration 34 took 55s (9.40% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 57m 30s. Estimated total time: 15h 32m 31s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 15s. [2026-03-25 14:59:35,147][__main__][INFO] - Starting iteration 34. [2026-03-25 14:59:35,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 14:59:35,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 14:59:40,277][__main__][INFO] - Number of regex retries in iteration 34: 0 [2026-03-25 14:59:40,278][__main__][INFO] - agents played in iteration 34 are Bob, Alice [2026-03-25 14:59:40,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:59:40,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 14:59:40,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 14:59:40,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 14:59:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 14:59:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 14:59:42,879][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 14:59:43,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 14:59:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 14:59:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 14:59:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 14:59:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 14:59:47,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 14:59:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 14:59:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 14:59:49,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 14:59:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 14:59:50,767][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 14:59:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 14:59:52,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 14:59:52,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 14:59:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 14:59:54,358][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 14:59:55,077][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 14:59:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 14:59:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 14:59:57,232][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 14:59:57,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 14:59:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 14:59:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:00:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:00:00,825][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:00:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:00:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:00:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:00:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:00:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:00:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:00:05,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:00:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:00:07,293][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:00:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:00:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:00:09,452][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:00:10,172][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:00:10,895][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:00:11,614][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:00:12,335][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:00:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:00:13,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:00:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:00:15,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:00:16,232][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:00:16,953][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:00:17,673][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:00:18,394][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:00:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:00:19,834][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:00:20,555][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:00:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:00:21,996][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:00:22,717][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:00:23,437][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:00:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:00:24,878][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:00:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:00:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:00:27,040][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:00:27,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:00:28,519][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:00:29,523][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:00:29,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:00:29,527][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:00:30,774][__main__][INFO] - Iteration 35 took 55s (9.21% Gen, 88.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 51m 8s. Estimated total time: 15h 27m 4s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 42s, 500 more iterations: 7h 43m 32s. [2026-03-25 15:00:30,777][__main__][INFO] - Starting iteration 35. [2026-03-25 15:00:30,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:00:30,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:00:35,878][__main__][INFO] - Number of regex retries in iteration 35: 0 [2026-03-25 15:00:35,879][__main__][INFO] - agents played in iteration 35 are Bob, Alice [2026-03-25 15:00:36,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:00:36,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:00:36,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:00:36,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:00:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:00:37,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:00:38,512][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:00:39,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:00:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:00:40,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:00:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:00:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:00:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:00:43,538][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:00:44,254][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:00:44,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:00:45,692][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:00:46,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:00:47,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:00:47,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:00:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:00:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:00:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:00:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:00:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:00:52,156][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:00:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:00:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:00:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:00:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:00:55,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:00:56,469][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:00:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:00:57,908][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:00:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:00:59,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:01:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:01:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:01:01,505][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:01:02,224][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:01:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:01:03,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:01:04,381][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:01:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:01:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:01:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:01:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:01:07,978][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:01:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:01:09,420][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:01:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:01:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:01:11,804][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:01:12,524][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:01:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:01:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:01:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:01:15,402][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:01:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:01:16,843][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:01:17,562][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:01:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:01:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:01:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:01:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:01:21,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:01:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:01:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:01:23,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:01:24,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:01:25,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:01:25,059][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:01:25,060][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:01:26,364][__main__][INFO] - Iteration 36 took 55s (9.17% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 49m 32s. Estimated total time: 15h 26m 24s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 12s. [2026-03-25 15:01:26,366][__main__][INFO] - Starting iteration 36. [2026-03-25 15:01:26,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:01:26,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:01:31,510][__main__][INFO] - Number of regex retries in iteration 36: 0 [2026-03-25 15:01:31,511][__main__][INFO] - agents played in iteration 36 are Bob, Alice [2026-03-25 15:01:32,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:01:32,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:01:32,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:01:32,127][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:01:32,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:01:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:01:34,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:01:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:01:35,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:01:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:01:37,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:01:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:01:38,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:01:39,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:01:39,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:01:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:01:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:01:42,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:01:42,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:01:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:01:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:01:44,992][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:01:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:01:46,428][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:01:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:01:47,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:01:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:01:49,303][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:01:50,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:01:50,742][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:01:51,461][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:01:52,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:01:52,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:01:53,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:01:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:01:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:01:55,779][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:01:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:01:57,218][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:01:57,938][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:01:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:01:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:02:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:02:00,817][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:02:01,537][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:02:02,258][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:02:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:02:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:02:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:02:05,139][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:02:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:02:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:02:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:02:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:02:08,968][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:02:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:02:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:02:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:02:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:02:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:02:13,294][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:02:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:02:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:02:15,461][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:02:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:02:16,904][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:02:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:02:18,349][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:02:19,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:02:19,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:02:20,887][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:02:20,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:02:20,892][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:02:22,384][__main__][INFO] - Iteration 37 took 56s (9.17% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 55m 47s. Estimated total time: 15h 33m 35s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 21s, 500 more iterations: 7h 46m 47s. [2026-03-25 15:02:22,387][__main__][INFO] - Starting iteration 37. [2026-03-25 15:02:22,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:02:22,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:02:27,454][__main__][INFO] - Number of regex retries in iteration 37: 0 [2026-03-25 15:02:27,456][__main__][INFO] - agents played in iteration 37 are Bob, Alice [2026-03-25 15:02:27,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:02:28,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:02:28,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:02:28,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:02:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:02:29,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:02:30,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:02:30,842][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:02:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:02:32,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:02:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:02:33,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:02:34,434][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:02:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:02:35,873][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:02:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:02:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:02:38,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:02:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:02:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:02:40,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:02:40,905][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:02:41,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:02:42,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:02:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:02:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:02:44,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:02:45,223][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:02:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:02:46,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:02:47,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:02:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:02:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:02:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:02:50,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:02:50,980][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:02:51,699][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:02:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:02:53,141][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:02:53,861][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:02:54,582][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:02:55,302][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:02:56,023][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:02:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:02:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:02:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:02:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:02:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:03:00,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:03:01,066][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:03:01,787][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:03:02,508][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:03:03,537][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:03:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:03:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:03:05,700][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:03:06,421][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:03:07,141][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:03:07,863][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:03:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:03:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:03:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:03:10,746][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:03:11,467][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:03:12,190][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:03:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:03:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:03:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:03:15,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:03:15,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:03:16,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:03:17,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:03:17,006][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:03:18,255][__main__][INFO] - Iteration 38 took 55s (9.07% Gen, 88.70% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 52m 21s. Estimated total time: 15h 31m 5s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 32s. [2026-03-25 15:03:18,258][__main__][INFO] - Starting iteration 38. [2026-03-25 15:03:18,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:03:18,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:03:23,519][__main__][INFO] - Number of regex retries in iteration 38: 0 [2026-03-25 15:03:23,520][__main__][INFO] - agents played in iteration 38 are Bob, Alice [2026-03-25 15:03:23,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:03:24,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:03:24,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:03:24,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:03:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:03:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:03:26,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:03:26,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:03:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:03:28,311][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:03:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:03:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:03:30,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:03:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:03:31,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:03:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:03:33,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:03:34,064][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:03:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:03:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:03:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:03:36,941][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:03:37,662][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:03:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:03:39,101][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:03:39,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:03:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:03:41,264][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:03:41,983][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:03:42,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:03:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:03:44,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:03:44,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:03:45,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:03:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:03:47,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:03:47,747][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:03:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:03:49,187][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:03:49,908][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:03:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:03:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:03:52,069][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:03:52,790][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:03:53,510][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:03:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:03:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:03:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:03:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:03:57,114][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:03:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:03:58,557][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:03:59,510][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:04:00,231][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:04:00,952][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:04:01,673][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:04:02,394][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:04:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:04:03,836][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:04:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:04:05,277][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:04:05,999][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:04:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:04:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:04:08,163][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:04:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:04:09,608][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:04:10,330][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:04:11,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:04:11,768][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:04:13,035][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:04:13,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:04:13,041][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:04:14,533][__main__][INFO] - Iteration 39 took 56s (9.34% Gen, 88.00% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 58m 11s. Estimated total time: 15h 37m 51s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 55s. [2026-03-25 15:04:14,540][__main__][INFO] - Starting iteration 39. [2026-03-25 15:04:14,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:04:14,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:04:19,613][__main__][INFO] - Number of regex retries in iteration 39: 0 [2026-03-25 15:04:19,614][__main__][INFO] - agents played in iteration 39 are Bob, Alice [2026-03-25 15:04:20,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:04:20,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:04:20,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:04:20,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:04:20,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:04:21,493][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:04:22,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:04:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:04:23,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:04:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:04:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:04:25,811][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:04:26,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:04:27,253][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:04:27,971][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:04:28,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:04:29,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:04:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:04:30,852][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:04:31,572][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:04:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:04:33,013][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:04:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:04:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:04:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:04:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:04:36,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:04:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:04:38,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:04:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:04:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:04:40,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:04:40,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:04:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:04:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:04:43,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:04:43,831][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:04:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:04:45,274][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:04:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:04:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:04:47,437][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:04:48,159][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:04:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:04:49,601][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:04:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:04:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:04:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:04:52,486][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:04:53,209][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:04:53,931][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:04:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:04:55,603][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:04:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:04:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:04:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:04:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:04:59,214][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:04:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:05:00,659][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:05:01,380][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:05:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:05:02,825][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:05:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:05:04,269][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:05:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:05:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:05:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:05:07,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:05:07,906][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:05:08,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:05:08,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:05:08,879][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:05:10,216][__main__][INFO] - Iteration 40 took 55s (9.10% Gen, 88.49% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 47m 16s. Estimated total time: 15h 27m 52s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s. [2026-03-25 15:05:10,219][__main__][INFO] - Starting iteration 40. [2026-03-25 15:05:10,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:05:10,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:05:15,657][__main__][INFO] - Number of regex retries in iteration 40: 0 [2026-03-25 15:05:15,658][__main__][INFO] - agents played in iteration 40 are Bob, Alice [2026-03-25 15:05:16,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:05:16,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:05:16,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:05:16,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:05:16,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:05:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:05:18,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:05:19,006][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:05:19,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:05:20,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:05:21,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:05:21,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:05:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:05:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:05:24,043][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:05:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:05:25,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:05:26,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:05:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:05:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:05:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:05:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:05:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:05:30,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:05:31,244][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:05:31,965][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:05:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:05:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:05:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:05:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:05:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:05:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:05:37,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:05:37,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:05:38,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:05:39,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:05:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:05:40,605][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:05:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:05:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:05:42,764][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:05:43,483][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:05:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:05:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:05:45,644][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:05:46,365][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:05:47,083][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:05:47,803][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:05:48,525][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:05:49,244][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:05:49,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:05:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:05:51,663][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:05:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:05:53,104][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:05:53,825][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:05:54,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:05:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:05:55,987][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:05:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:05:57,426][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:05:58,148][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:05:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:05:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:06:00,311][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:06:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:06:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:06:02,476][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:06:03,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:06:03,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:06:04,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:06:04,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:06:04,852][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:06:06,172][__main__][INFO] - Iteration 41 took 55s (9.71% Gen, 87.92% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 50m 58s. Estimated total time: 15h 32m 30s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 15s. [2026-03-25 15:06:06,175][__main__][INFO] - Starting iteration 41. [2026-03-25 15:06:06,178][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:06:06,179][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:06:11,336][__main__][INFO] - Number of regex retries in iteration 41: 0 [2026-03-25 15:06:11,337][__main__][INFO] - agents played in iteration 41 are Bob, Alice [2026-03-25 15:06:11,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:06:11,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:06:11,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:06:11,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:06:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:06:13,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:06:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:06:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:06:15,418][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:06:16,137][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:06:16,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:06:17,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:06:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:06:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:06:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:06:20,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:06:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:06:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:06:22,619][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:06:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:06:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:06:24,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:06:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:06:26,223][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:06:26,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:06:27,662][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:06:28,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:06:29,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:06:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:06:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:06:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:06:31,984][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:06:32,703][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:06:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:06:34,145][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:06:34,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:06:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:06:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:06:37,025][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:06:37,746][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:06:38,469][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:06:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:06:39,910][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:06:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:06:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:06:42,073][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:06:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:06:43,517][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:06:44,239][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:06:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:06:45,681][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:06:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:06:47,352][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:06:48,074][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:06:48,795][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:06:49,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:06:50,236][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:06:50,959][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:06:51,680][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:06:52,401][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:06:53,122][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:06:53,845][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:06:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:06:55,288][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:06:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:06:56,731][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:06:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:06:58,177][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:06:58,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:06:59,626][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:07:01,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:07:01,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:07:01,724][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:07:03,256][__main__][INFO] - Iteration 42 took 57s (9.04% Gen, 88.28% Train). Generation: 5s, Training: 50s. Estimated remaining time: 15h 8m 50s. Estimated total time: 15h 51m 19s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 7s, 500 more iterations: 7h 55m 39s. [2026-03-25 15:07:03,260][__main__][INFO] - Starting iteration 42. [2026-03-25 15:07:03,265][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:07:03,266][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:07:08,543][__main__][INFO] - Number of regex retries in iteration 42: 0 [2026-03-25 15:07:08,544][__main__][INFO] - agents played in iteration 42 are Bob, Alice [2026-03-25 15:07:09,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:07:09,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:07:09,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:07:09,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:07:09,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:07:10,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:07:11,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:07:11,855][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:07:12,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:07:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:07:14,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:07:14,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:07:15,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:07:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:07:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:07:17,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:07:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:07:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:07:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:07:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:07:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:07:21,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:07:22,640][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:07:23,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:07:24,082][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:07:24,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:07:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:07:26,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:07:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:07:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:07:28,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:07:29,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:07:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:07:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:07:31,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:07:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:07:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:07:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:07:34,161][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:07:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:07:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:07:36,322][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:07:37,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:07:37,763][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:07:38,484][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:07:39,207][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:07:39,928][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:07:40,650][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:07:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:07:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:07:42,815][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:07:43,537][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:07:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:07:45,281][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:07:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:07:46,724][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:07:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:07:48,167][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:07:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:07:49,612][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:07:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:07:51,055][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:07:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:07:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:07:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:07:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:07:54,662][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:07:55,385][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:07:56,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:07:56,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:07:58,001][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:07:58,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:07:58,007][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:07:59,269][__main__][INFO] - Iteration 43 took 56s (9.43% Gen, 88.32% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 50m 0s. Estimated total time: 15h 33m 25s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 42s. [2026-03-25 15:07:59,273][__main__][INFO] - Starting iteration 43. [2026-03-25 15:07:59,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:07:59,280][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:08:04,455][__main__][INFO] - Number of regex retries in iteration 43: 0 [2026-03-25 15:08:04,457][__main__][INFO] - agents played in iteration 43 are Bob, Alice [2026-03-25 15:08:05,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:08:05,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:08:05,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:08:05,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:08:05,774][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:08:06,421][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:08:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:08:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:08:08,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:08:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:08:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:08:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:08:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:08:12,181][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:08:12,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:08:13,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:08:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:08:15,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:08:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:08:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:08:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:08:17,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:08:18,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:08:19,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:08:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:08:20,818][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:08:21,537][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:08:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:08:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:08:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:08:24,419][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:08:25,141][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:08:25,862][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:08:26,581][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:08:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:08:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:08:28,743][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:08:29,463][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:08:30,183][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:08:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:08:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:08:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:08:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:08:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:08:34,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:08:35,228][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:08:35,948][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:08:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:08:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:08:38,110][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:08:38,832][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:08:39,554][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:08:40,529][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:08:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:08:41,971][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:08:42,691][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:08:43,413][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:08:44,135][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:08:44,856][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:08:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:08:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:08:47,019][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:08:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:08:48,464][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:08:49,184][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:08:49,905][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:08:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:08:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:08:52,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:08:52,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:08:53,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:08:53,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:08:53,745][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:08:55,047][__main__][INFO] - Iteration 44 took 55s (9.28% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 45m 9s. Estimated total time: 15h 29m 30s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 45s. [2026-03-25 15:08:55,051][__main__][INFO] - Starting iteration 44. [2026-03-25 15:08:55,056][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:08:55,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:09:00,145][__main__][INFO] - Number of regex retries in iteration 44: 0 [2026-03-25 15:09:00,146][__main__][INFO] - agents played in iteration 44 are Bob, Alice [2026-03-25 15:09:00,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:09:00,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:09:00,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:09:00,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:09:01,450][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:09:02,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:09:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:09:03,539][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:09:04,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:09:04,975][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:09:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:09:06,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:09:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:09:07,853][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:09:08,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:09:09,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:09:10,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:09:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:09:11,450][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:09:12,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:09:12,889][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:09:13,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:09:14,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:09:15,047][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:09:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:09:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:09:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:09:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:09:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:09:19,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:09:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:09:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:09:21,536][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:09:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:09:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:09:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:09:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:09:25,137][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:09:25,856][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:09:26,576][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:09:27,298][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:09:28,017][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:09:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:09:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:09:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:09:30,900][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:09:31,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:09:32,343][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:09:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:09:33,784][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:09:34,505][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:09:35,226][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:09:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:09:36,891][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:09:37,610][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:09:38,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:09:39,054][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:09:39,775][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:09:40,496][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:09:41,218][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:09:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:09:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:09:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:09:44,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:09:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:09:45,542][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:09:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:09:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:09:47,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:09:48,434][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:09:49,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:09:49,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:09:49,456][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:09:51,253][__main__][INFO] - Iteration 45 took 56s (9.06% Gen, 87.74% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 51m 22s. Estimated total time: 15h 36m 38s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 19s. [2026-03-25 15:09:51,255][__main__][INFO] - Starting iteration 45. [2026-03-25 15:09:51,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:09:51,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:09:56,349][__main__][INFO] - Number of regex retries in iteration 45: 0 [2026-03-25 15:09:56,350][__main__][INFO] - agents played in iteration 45 are Bob, Alice [2026-03-25 15:09:56,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:09:56,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:09:56,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:09:56,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:09:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:09:58,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:09:58,945][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:09:59,663][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:10:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:10:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:10:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:10:02,533][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:10:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:10:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:10:04,688][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:10:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:10:06,124][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:10:06,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:10:07,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:10:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:10:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:10:09,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:10:10,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:10:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:10:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:10:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:10:13,315][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:10:14,032][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:10:14,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:10:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:10:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:10:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:10:17,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:10:18,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:10:19,069][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:10:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:10:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:10:21,226][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:10:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:10:22,666][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:10:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:10:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:10:24,825][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:10:25,545][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:10:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:10:26,984][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:10:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:10:28,423][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:10:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:10:29,864][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:10:30,584][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:10:31,305][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:10:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:10:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:10:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:10:34,483][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:10:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:10:35,924][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:10:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:10:37,364][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:10:38,086][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:10:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:10:39,528][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:10:40,249][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:10:40,972][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:10:41,693][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:10:42,414][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:10:43,133][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:10:43,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:10:44,599][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:10:46,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:10:46,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:10:46,079][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:10:47,598][__main__][INFO] - Iteration 46 took 56s (9.03% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 52m 47s. Estimated total time: 15h 39m 0s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 30s. [2026-03-25 15:10:47,601][__main__][INFO] - Starting iteration 46. [2026-03-25 15:10:47,606][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:10:47,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:10:52,714][__main__][INFO] - Number of regex retries in iteration 46: 0 [2026-03-25 15:10:52,715][__main__][INFO] - agents played in iteration 46 are Bob, Alice [2026-03-25 15:10:53,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:10:53,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:10:53,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:10:53,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:10:53,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:10:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:10:55,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:10:56,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:10:56,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:10:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:10:58,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:10:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:10:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:11:00,329][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:11:01,046][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:11:01,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:11:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:11:03,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:11:03,919][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:11:04,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:11:05,356][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:11:06,073][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:11:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:11:07,512][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:11:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:11:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:11:09,668][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:11:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:11:11,105][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:11:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:11:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:11:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:11:13,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:11:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:11:15,418][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:11:16,139][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:11:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:11:17,577][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:11:18,298][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:11:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:11:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:11:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:11:21,179][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:11:21,897][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:11:22,617][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:11:23,337][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:11:24,056][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:11:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:11:25,497][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:11:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:11:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:11:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:11:28,600][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:11:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:11:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:11:30,761][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:11:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:11:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:11:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:11:33,644][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:11:34,364][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:11:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:11:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:11:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:11:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:11:37,964][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:11:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:11:39,409][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:11:40,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:11:40,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:11:41,988][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:11:41,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:11:41,995][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:11:43,322][__main__][INFO] - Iteration 47 took 55s (9.17% Gen, 88.44% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 41m 29s. Estimated total time: 15h 28m 38s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 19s. [2026-03-25 15:11:43,325][__main__][INFO] - Starting iteration 47. [2026-03-25 15:11:43,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:11:43,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:11:48,435][__main__][INFO] - Number of regex retries in iteration 47: 0 [2026-03-25 15:11:48,436][__main__][INFO] - agents played in iteration 47 are Bob, Alice [2026-03-25 15:11:48,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:11:48,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:11:48,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:11:48,980][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:11:49,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:11:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:11:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:11:51,752][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:11:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:11:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:11:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:11:54,623][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:11:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:11:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:11:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:11:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:11:58,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:11:58,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:11:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:12:00,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:12:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:12:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:12:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:12:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:12:03,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:12:04,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:12:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:12:06,131][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:12:06,851][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:12:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:12:08,291][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:12:09,013][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:12:09,734][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:12:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:12:11,175][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:12:11,894][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:12:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:12:13,336][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:12:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:12:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:12:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:12:16,218][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:12:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:12:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:12:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:12:19,099][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:12:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:12:20,541][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:12:21,261][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:12:21,981][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:12:22,703][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:12:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:12:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:12:25,095][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:12:25,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:12:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:12:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:12:27,977][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:12:28,697][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:12:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:12:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:12:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:12:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:12:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:12:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:12:33,745][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:12:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:12:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:12:35,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:12:36,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:12:37,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:12:37,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:12:37,709][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:12:39,111][__main__][INFO] - Iteration 48 took 55s (9.15% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 41m 39s. Estimated total time: 15h 29m 44s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 52s. [2026-03-25 15:12:39,114][__main__][INFO] - Starting iteration 48. [2026-03-25 15:12:39,118][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:12:39,119][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:12:44,486][__main__][INFO] - Number of regex retries in iteration 48: 0 [2026-03-25 15:12:44,487][__main__][INFO] - agents played in iteration 48 are Bob, Alice [2026-03-25 15:12:44,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:12:45,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:12:45,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:12:45,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:12:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:12:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:12:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:12:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:12:48,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:12:49,224][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:12:49,942][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:12:50,660][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:12:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:12:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:12:52,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:12:53,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:12:54,250][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:12:54,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:12:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:12:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:12:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:12:57,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:12:58,563][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:12:59,281][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:13:00,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:13:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:13:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:13:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:13:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:13:03,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:13:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:13:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:13:05,754][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:13:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:13:07,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:13:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:13:08,632][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:13:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:13:10,071][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:13:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:13:11,507][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:13:12,229][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:13:12,948][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:13:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:13:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:13:15,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:13:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:13:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:13:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:13:17,987][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:13:18,706][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:13:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:13:20,408][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:13:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:13:21,848][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:13:22,567][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:13:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:13:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:13:24,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:13:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:13:26,169][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:13:26,889][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:13:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:13:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:13:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:13:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:13:30,489][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:13:31,209][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:13:31,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:13:32,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:13:33,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:13:33,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:13:33,813][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:13:35,074][__main__][INFO] - Iteration 49 took 55s (9.59% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 43m 37s. Estimated total time: 15h 32m 38s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 19s. [2026-03-25 15:13:35,077][__main__][INFO] - Starting iteration 49. [2026-03-25 15:13:35,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:13:35,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:13:40,219][__main__][INFO] - Number of regex retries in iteration 49: 0 [2026-03-25 15:13:40,220][__main__][INFO] - agents played in iteration 49 are Bob, Alice [2026-03-25 15:13:40,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:13:40,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:13:40,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:13:40,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:13:41,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:13:42,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:13:42,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:13:43,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:13:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:13:44,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:13:45,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:13:46,400][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:13:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:13:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:13:48,553][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:13:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:13:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:13:50,709][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:13:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:13:52,151][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:13:52,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:13:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:13:54,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:13:55,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:13:55,751][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:13:56,472][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:13:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:13:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:13:58,632][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:13:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:14:00,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:14:00,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:14:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:14:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:14:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:14:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:14:04,398][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:14:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:14:05,839][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:14:06,561][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:14:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:14:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:14:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:14:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:14:10,164][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:14:10,884][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:14:11,606][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:14:12,328][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:14:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:14:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:14:14,491][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:14:15,211][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:14:16,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:14:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:14:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:14:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:14:19,053][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:14:19,773][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:14:20,496][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:14:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:14:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:14:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:14:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:14:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:14:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:14:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:14:26,266][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:14:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:14:27,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:14:28,457][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:14:29,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:14:29,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:14:29,524][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:14:30,891][__main__][INFO] - Iteration 50 took 55s (9.21% Gen, 88.34% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 40m 14s. Estimated total time: 15h 30m 11s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 5s. [2026-03-25 15:14:30,893][__main__][INFO] - Starting iteration 50. [2026-03-25 15:14:30,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:14:30,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:14:35,987][__main__][INFO] - Number of regex retries in iteration 50: 0 [2026-03-25 15:14:35,988][__main__][INFO] - agents played in iteration 50 are Bob, Alice [2026-03-25 15:14:36,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:14:36,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:14:36,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:14:36,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:14:37,212][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:14:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:14:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:14:39,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:14:40,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:14:40,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:14:41,453][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:14:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:14:42,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:14:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:14:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:14:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:14:45,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:14:46,485][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:14:47,203][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:14:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:14:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:14:49,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:14:50,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:14:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:14:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:14:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:14:52,957][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:14:53,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:14:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:14:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:14:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:14:56,556][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:14:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:14:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:14:58,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:14:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:15:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:15:00,874][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:15:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:15:02,315][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:15:03,035][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:15:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:15:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:15:05,196][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:15:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:15:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:15:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:15:08,079][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:15:08,801][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:15:09,523][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:15:10,243][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:15:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:15:11,926][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:15:12,648][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:15:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:15:14,087][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:15:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:15:15,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:15:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:15:16,973][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:15:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:15:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:15:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:15:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:15:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:15:21,299][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:15:22,021][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:15:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:15:23,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:15:24,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:15:25,307][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:15:25,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:15:25,313][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:15:27,873][__main__][INFO] - Iteration 51 took 56s (8.93% Gen, 86.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 58m 45s. Estimated total time: 15h 49m 38s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 57s, 500 more iterations: 7h 54m 49s. [2026-03-25 15:15:27,877][__main__][INFO] - Starting iteration 51. [2026-03-25 15:15:27,882][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:15:27,882][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:15:33,329][__main__][INFO] - Number of regex retries in iteration 51: 0 [2026-03-25 15:15:33,330][__main__][INFO] - agents played in iteration 51 are Bob, Alice [2026-03-25 15:15:33,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:15:33,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:15:33,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:15:33,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:15:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:15:35,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:15:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:15:36,716][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:15:37,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:15:38,153][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:15:38,873][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:15:39,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:15:40,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:15:41,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:15:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:15:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:15:43,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:15:43,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:15:44,621][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:15:45,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:15:46,059][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:15:46,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:15:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:15:48,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:15:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:15:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:15:50,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:15:51,092][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:15:51,814][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:15:52,532][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:15:53,252][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:15:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:15:54,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:15:55,410][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:15:56,131][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:15:56,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:15:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:15:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:15:59,010][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:15:59,730][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:16:00,452][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:16:01,172][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:16:01,893][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:16:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:16:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:16:04,056][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:16:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:16:05,495][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:16:06,217][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:16:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:16:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:16:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:16:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:16:10,073][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:16:10,794][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:16:11,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:16:12,235][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:16:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:16:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:16:14,400][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:16:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:16:15,840][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:16:16,562][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:16:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:16:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:16:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:16:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:16:20,169][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:16:20,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:16:21,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:16:22,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:16:22,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:16:22,697][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:16:23,981][__main__][INFO] - Iteration 52 took 56s (9.71% Gen, 88.00% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 43m 11s. Estimated total time: 15h 35m 1s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 30s, 500 more iterations: 7h 47m 30s. [2026-03-25 15:16:23,983][__main__][INFO] - Starting iteration 52. [2026-03-25 15:16:23,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:16:23,988][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:16:34,905][__main__][INFO] - Number of regex retries in iteration 52: 0 [2026-03-25 15:16:34,906][__main__][INFO] - agents played in iteration 52 are Bob, Alice [2026-03-25 15:16:35,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:16:35,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:16:35,487][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:16:35,488][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:16:36,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:16:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:16:37,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:16:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:16:38,980][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:16:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:16:40,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:16:41,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:16:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:16:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:16:43,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:16:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:16:44,711][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:16:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:16:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:16:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:16:47,580][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:16:48,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:16:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:16:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:16:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:16:51,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:16:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:16:52,612][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:16:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:16:54,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:16:54,769][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:16:55,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:16:56,207][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:16:56,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:16:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:16:58,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:16:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:16:59,803][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:17:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:17:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:17:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:17:02,681][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:17:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:17:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:17:04,842][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:17:05,561][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:17:06,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:17:06,999][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:17:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:17:08,440][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:17:09,161][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:17:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:17:10,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:17:11,553][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:17:12,272][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:17:12,991][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:17:13,711][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:17:14,431][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:17:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:17:15,870][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:17:16,589][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:17:17,307][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:17:18,029][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:17:18,748][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:17:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:17:20,188][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:17:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:17:21,626][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:17:22,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:17:23,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:17:24,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:17:24,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:17:24,632][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:17:26,062][__main__][INFO] - Iteration 53 took 1m 2s (17.59% Gen, 80.11% Train). Generation: 10s, Training: 49s. Estimated remaining time: 16h 21m 44s. Estimated total time: 17h 14m 36s. Time estimates for 10 more iterations: 10m 20s, 100 more iterations: 1h 43m 27s, 500 more iterations: 8h 37m 18s. [2026-03-25 15:17:26,065][__main__][INFO] - Starting iteration 53. [2026-03-25 15:17:26,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:17:26,072][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:17:31,433][__main__][INFO] - Number of regex retries in iteration 53: 0 [2026-03-25 15:17:31,434][__main__][INFO] - agents played in iteration 53 are Bob, Alice [2026-03-25 15:17:31,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:17:31,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:17:31,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:17:31,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:17:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:17:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:17:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:17:34,758][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:17:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:17:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:17:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:17:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:17:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:17:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:17:39,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:17:40,498][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:17:41,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:17:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:17:42,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:17:43,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:17:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:17:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:17:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:17:46,246][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:17:46,965][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:17:47,682][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:17:48,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:17:49,121][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:17:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:17:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:17:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:17:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:17:52,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:17:53,436][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:17:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:17:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:17:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:17:56,316][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:17:57,034][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:17:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:17:58,473][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:17:59,192][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:17:59,912][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:18:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:18:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:18:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:18:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:18:03,513][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:18:04,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:18:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:18:05,675][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:18:06,395][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:18:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:18:08,123][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:18:08,843][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:18:09,563][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:18:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:18:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:18:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:18:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:18:13,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:18:13,887][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:18:14,607][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:18:15,326][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:18:16,050][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:18:16,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:18:17,489][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:18:18,210][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:18:18,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:18:19,690][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:18:20,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:18:20,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:18:20,949][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:18:22,400][__main__][INFO] - Iteration 54 took 56s (9.52% Gen, 87.90% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 45m 3s. Estimated total time: 15h 38m 52s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 26s. [2026-03-25 15:18:22,404][__main__][INFO] - Starting iteration 54. [2026-03-25 15:18:22,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:18:22,412][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:18:29,170][__main__][INFO] - Number of regex retries in iteration 54: 0 [2026-03-25 15:18:29,171][__main__][INFO] - agents played in iteration 54 are Bob, Alice [2026-03-25 15:18:29,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:18:29,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:18:29,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:18:29,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:18:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:18:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:18:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:18:32,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:18:33,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:18:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:18:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:18:38,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:18:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:18:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:18:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:18:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:18:41,587][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:18:42,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:18:43,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:18:43,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:18:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:18:45,171][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:18:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:18:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:18:47,324][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:18:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:18:48,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:18:49,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:18:50,194][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:18:50,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:18:51,630][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:18:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:18:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:18:53,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:18:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:18:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:18:55,939][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:18:56,657][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:18:57,376][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:18:58,095][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:18:58,811][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:18:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:19:00,250][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:19:00,967][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:19:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:19:02,406][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:19:03,125][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:19:03,843][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:19:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:19:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:19:06,001][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:19:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:19:07,671][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:19:08,390][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:19:09,110][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:19:09,829][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:19:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:19:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:19:11,986][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:19:12,705][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:19:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:19:14,144][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:19:14,863][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:19:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:19:16,302][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:19:17,022][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:19:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:19:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:19:19,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:19:19,901][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:49 [2026-03-25 15:19:20,856][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:19:20,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:19:20,860][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:19:22,127][__main__][INFO] - Iteration 55 took 59s (11.32% Gen, 86.55% Train). Generation: 6s, Training: 51s. Estimated remaining time: 15h 40m 31s. Estimated total time: 16h 35m 19s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 31s, 500 more iterations: 8h 17m 39s. [2026-03-25 15:19:22,130][__main__][INFO] - Starting iteration 55. [2026-03-25 15:19:22,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:19:22,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:19:27,217][__main__][INFO] - Number of regex retries in iteration 55: 0 [2026-03-25 15:19:27,218][__main__][INFO] - agents played in iteration 55 are Bob, Alice [2026-03-25 15:19:27,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:19:27,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:19:27,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:19:27,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:19:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:19:29,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:19:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:19:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:19:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:19:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:19:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:19:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:19:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:19:34,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:19:35,542][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:19:36,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:19:36,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:19:37,697][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:19:38,414][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:19:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:19:39,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:19:40,569][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:19:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:19:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:19:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:19:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:19:44,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:19:44,882][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:19:45,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:19:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:19:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:19:47,758][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:19:48,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:19:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:19:49,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:19:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:19:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:19:52,073][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:19:52,791][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:19:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:19:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:19:54,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:19:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:19:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:19:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:19:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:19:58,546][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:19:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:19:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:20:00,707][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:20:01,426][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:20:02,146][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:20:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:20:03,810][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:20:04,530][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:20:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:20:05,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:20:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:20:07,411][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:20:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:20:08,851][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:20:09,572][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:20:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:20:11,012][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:20:11,733][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:20:12,453][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:20:13,172][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:20:13,893][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:20:14,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:20:15,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:20:16,428][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:20:16,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:20:16,434][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:20:17,796][__main__][INFO] - Iteration 56 took 55s (9.13% Gen, 88.42% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 32m 1s. Estimated total time: 15h 27m 44s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 52s. [2026-03-25 15:20:17,799][__main__][INFO] - Starting iteration 56. [2026-03-25 15:20:17,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:20:17,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:20:22,963][__main__][INFO] - Number of regex retries in iteration 56: 0 [2026-03-25 15:20:22,964][__main__][INFO] - agents played in iteration 56 are Bob, Alice [2026-03-25 15:20:23,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:20:23,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:20:23,504][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:20:23,505][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:20:24,197][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:20:24,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:20:25,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:20:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:20:26,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:20:27,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:20:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:20:29,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:20:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:20:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:20:31,306][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:20:32,024][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:20:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:20:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:20:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:20:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:20:35,618][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:20:36,337][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:20:37,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:20:37,773][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:20:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:20:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:20:39,931][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:20:40,651][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:20:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:20:42,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:20:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:20:43,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:20:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:20:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:20:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:20:46,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:20:47,123][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:20:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:20:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:20:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:20:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:20:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:20:51,441][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:20:52,158][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:20:52,879][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:20:53,598][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:20:54,317][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:20:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:20:55,756][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:20:56,476][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:20:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:20:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:20:58,866][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:20:59,585][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:21:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:21:01,023][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:21:01,744][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:21:02,463][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:21:03,185][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:21:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:21:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:21:05,344][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:21:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:21:06,784][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:21:07,504][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:21:08,224][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:21:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:21:09,665][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:21:10,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:21:11,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:21:12,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:21:12,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:21:12,102][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:21:13,416][__main__][INFO] - Iteration 57 took 55s (9.28% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 30m 15s. Estimated total time: 15h 26m 54s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 41s, 500 more iterations: 7h 43m 27s. [2026-03-25 15:21:13,418][__main__][INFO] - Starting iteration 57. [2026-03-25 15:21:13,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:21:13,423][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:21:18,647][__main__][INFO] - Number of regex retries in iteration 57: 0 [2026-03-25 15:21:18,648][__main__][INFO] - agents played in iteration 57 are Bob, Alice [2026-03-25 15:21:19,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:21:19,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:21:19,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:21:19,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:21:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:21:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:21:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:21:21,963][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:21:22,680][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:21:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:21:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:21:24,834][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:21:25,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:21:26,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:21:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:21:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:21:28,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:21:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:21:29,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:21:30,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:21:31,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:21:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:21:32,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:21:33,450][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:21:34,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:21:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:21:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:21:36,324][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:21:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:21:37,762][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:21:38,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:21:39,199][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:21:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:21:40,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:21:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:21:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:21:42,792][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:21:43,509][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:21:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:21:44,948][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:21:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:21:46,385][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:21:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:21:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:21:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:21:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:21:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:21:50,700][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:21:51,420][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:21:52,138][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:21:52,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:21:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:21:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:21:55,277][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:21:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:21:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:21:57,435][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:21:58,154][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:21:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:21:59,593][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:22:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:22:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:22:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:22:02,470][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:22:03,191][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:22:03,910][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:22:04,628][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:22:05,349][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:22:06,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:22:06,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:22:08,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:22:08,023][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:22:08,025][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:22:09,311][__main__][INFO] - Iteration 58 took 55s (9.35% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 33m 55s. Estimated total time: 15h 31m 30s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 45s. [2026-03-25 15:22:09,313][__main__][INFO] - Starting iteration 58. [2026-03-25 15:22:09,324][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:22:09,325][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:22:14,464][__main__][INFO] - Number of regex retries in iteration 58: 0 [2026-03-25 15:22:14,466][__main__][INFO] - agents played in iteration 58 are Bob, Alice [2026-03-25 15:22:14,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:22:15,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:22:15,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:22:15,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:22:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:22:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:22:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:22:17,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:22:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:22:19,285][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:22:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:22:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:22:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:22:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:22:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:22:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:22:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:22:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:22:25,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:22:26,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:22:27,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:22:27,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:22:28,614][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:22:29,332][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:22:30,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:22:30,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:22:31,486][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:22:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:22:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:22:33,639][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:22:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:22:35,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:22:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:22:36,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:22:37,234][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:22:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:22:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:22:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:22:40,110][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:22:40,829][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:22:41,547][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:22:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:22:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:22:43,703][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:22:44,422][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:22:45,140][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:22:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:22:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:22:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:22:48,014][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:22:48,734][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:22:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:22:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:22:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:22:51,835][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:22:52,556][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:22:53,276][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:22:53,996][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:22:54,714][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:22:55,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:22:56,156][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:22:56,875][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:22:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:22:58,316][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:22:59,036][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:22:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:23:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:23:01,197][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:23:01,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:23:02,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:23:05,332][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:23:05,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:23:05,340][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:23:07,449][__main__][INFO] - Iteration 59 took 58s (8.84% Gen, 87.52% Train). Generation: 5s, Training: 50s. Estimated remaining time: 15h 10m 13s. Estimated total time: 16h 8m 47s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 52s, 500 more iterations: 8h 4m 23s. [2026-03-25 15:23:07,452][__main__][INFO] - Starting iteration 59. [2026-03-25 15:23:07,457][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:23:07,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:23:12,567][__main__][INFO] - Number of regex retries in iteration 59: 0 [2026-03-25 15:23:12,568][__main__][INFO] - agents played in iteration 59 are Bob, Alice [2026-03-25 15:23:13,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:23:13,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:23:13,219][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:23:13,220][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:23:13,906][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:23:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:23:15,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:23:15,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:23:16,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:23:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:23:18,140][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:23:18,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:23:19,572][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:23:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:23:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:23:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:23:22,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:23:23,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:23:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:23:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:23:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:23:26,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:23:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:23:27,454][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:23:28,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:23:28,889][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:23:29,605][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:23:30,322][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:23:31,039][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:23:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:23:32,473][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:23:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:23:33,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:23:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:23:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:23:36,057][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:23:36,778][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:23:37,495][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:23:38,213][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:23:38,931][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:23:39,649][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:23:40,366][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:23:41,083][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:23:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:23:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:23:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:23:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:23:44,670][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:23:45,388][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:23:46,106][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:23:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:23:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:23:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:23:49,218][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:23:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:23:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:23:51,372][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:23:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:23:52,809][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:23:53,527][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:23:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:23:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:23:55,680][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:23:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:23:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:23:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:23:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:23:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:23:59,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:24:00,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:24:01,877][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:24:01,882][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:24:01,884][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:24:03,785][__main__][INFO] - Iteration 60 took 56s (9.07% Gen, 87.55% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 39m 20s. Estimated total time: 15h 38m 50s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 25s. [2026-03-25 15:24:03,790][__main__][INFO] - Starting iteration 60. [2026-03-25 15:24:03,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:24:03,795][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:24:08,913][__main__][INFO] - Number of regex retries in iteration 60: 0 [2026-03-25 15:24:08,914][__main__][INFO] - agents played in iteration 60 are Bob, Alice [2026-03-25 15:24:09,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:24:09,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:24:09,453][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:24:09,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:24:10,142][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:24:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:24:11,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:24:12,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:24:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:24:13,658][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:24:14,375][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:24:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:24:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:24:16,526][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:24:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:24:17,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:24:18,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:24:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:24:20,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:24:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:24:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:24:22,255][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:24:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:24:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:24:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:24:25,121][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:24:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:24:26,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:24:27,276][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:24:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:24:28,711][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:24:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:24:30,147][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:24:30,866][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:24:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:24:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:24:33,018][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:24:33,737][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:24:34,453][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:24:35,171][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:24:35,889][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:24:36,607][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:24:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:24:38,044][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:24:38,762][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:24:39,482][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:24:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:24:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:24:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:24:42,353][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:24:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:24:43,790][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:24:44,777][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:24:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:24:46,215][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:24:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:24:47,654][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:24:48,370][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:24:49,091][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:24:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:24:50,525][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:24:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:24:51,964][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:24:52,681][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:24:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:24:54,118][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:24:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:24:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:24:56,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:24:57,001][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:24:58,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:24:58,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:24:58,155][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:24:59,410][__main__][INFO] - Iteration 61 took 55s (9.20% Gen, 88.53% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 26m 33s. Estimated total time: 15h 26m 58s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 41s, 500 more iterations: 7h 43m 29s. [2026-03-25 15:24:59,412][__main__][INFO] - Starting iteration 61. [2026-03-25 15:24:59,416][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:24:59,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:25:04,498][__main__][INFO] - Number of regex retries in iteration 61: 0 [2026-03-25 15:25:04,499][__main__][INFO] - agents played in iteration 61 are Bob, Alice [2026-03-25 15:25:04,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:25:05,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:25:05,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:25:05,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:25:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:25:06,406][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:25:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:25:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:25:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:25:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:25:09,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:25:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:25:11,427][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:25:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:25:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:25:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:25:14,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:25:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:25:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:25:16,448][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:25:17,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:25:17,885][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:25:18,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:25:19,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:25:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:25:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:25:21,474][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:25:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:25:22,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:25:23,623][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:25:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:25:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:25:25,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:25:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:25:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:25:27,923][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:25:28,641][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:25:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:25:30,076][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:25:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:25:31,511][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:25:32,230][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:25:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:25:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:25:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:25:35,107][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:25:35,825][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:25:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:25:37,259][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:25:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:25:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:25:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:25:40,381][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:25:41,101][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:25:41,817][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:25:42,537][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:25:43,255][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:25:43,969][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:25:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:25:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:25:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:25:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:25:47,562][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:25:48,282][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:25:49,000][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:25:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:25:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:25:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:25:51,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:25:52,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:25:53,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:25:53,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:25:53,852][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:25:55,321][__main__][INFO] - Iteration 62 took 55s (9.09% Gen, 88.28% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 30m 25s. Estimated total time: 15h 31m 46s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 10s, 500 more iterations: 7h 45m 53s. [2026-03-25 15:25:55,323][__main__][INFO] - Starting iteration 62. [2026-03-25 15:25:55,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:25:55,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:26:00,482][__main__][INFO] - Number of regex retries in iteration 62: 0 [2026-03-25 15:26:00,483][__main__][INFO] - agents played in iteration 62 are Bob, Alice [2026-03-25 15:26:00,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:26:01,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:26:01,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:26:01,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:26:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:26:02,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:26:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:26:03,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:26:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:26:05,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:26:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:26:06,688][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:26:07,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:26:08,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:26:08,839][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:26:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:26:10,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:26:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:26:11,705][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:26:12,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:26:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:26:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:26:14,570][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:26:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:26:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:26:16,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:26:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:26:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:26:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:26:19,582][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:26:20,298][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:26:21,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:26:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:26:22,443][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:26:23,162][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:26:23,876][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:26:24,595][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:26:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:26:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:26:26,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:26:27,460][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:26:28,175][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:26:28,893][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:26:29,608][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:26:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:26:31,040][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:26:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:26:32,473][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:26:33,188][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:26:33,906][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:26:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:26:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:26:36,348][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:26:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:26:37,783][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:26:38,501][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:26:39,221][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:26:39,940][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:26:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:26:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:26:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:26:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:26:43,528][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:26:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:26:44,962][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:26:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:26:46,396][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:26:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:26:47,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:26:48,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:26:49,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:26:49,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:26:49,796][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:26:51,071][__main__][INFO] - Iteration 63 took 55s (9.25% Gen, 88.46% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 26m 49s. Estimated total time: 15h 29m 6s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 54s, 500 more iterations: 7h 44m 33s. [2026-03-25 15:26:51,074][__main__][INFO] - Starting iteration 63. [2026-03-25 15:26:51,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:26:51,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:26:56,261][__main__][INFO] - Number of regex retries in iteration 63: 0 [2026-03-25 15:26:56,262][__main__][INFO] - agents played in iteration 63 are Bob, Alice [2026-03-25 15:26:56,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:26:56,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:26:56,802][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:26:56,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:26:57,485][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:26:58,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:26:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:26:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:27:00,279][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:27:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:27:01,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:27:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:27:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:27:03,857][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:27:04,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:27:05,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:27:06,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:27:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:27:07,437][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:27:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:27:08,870][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:27:09,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:27:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:27:11,019][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:27:11,736][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:27:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:27:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:27:13,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:27:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:27:15,317][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:27:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:27:16,751][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:27:17,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:27:18,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:27:18,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:27:19,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:27:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:27:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:27:21,760][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:27:22,479][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:27:23,196][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:27:23,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:27:24,629][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:27:25,345][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:27:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:27:26,778][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:27:27,492][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:27:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:27:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:27:29,640][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:27:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:27:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:27:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:27:32,731][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:27:33,448][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:27:34,163][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:27:34,880][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:27:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:27:36,311][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:27:37,027][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:27:37,743][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:27:38,460][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:27:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:27:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:27:40,611][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:27:41,327][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:27:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:27:42,761][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:27:43,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:27:44,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:27:45,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:27:45,374][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:27:45,376][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:27:47,460][__main__][INFO] - Iteration 64 took 56s (9.19% Gen, 87.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 36m 30s. Estimated total time: 15h 39m 43s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 58s, 500 more iterations: 7h 49m 51s. [2026-03-25 15:27:47,463][__main__][INFO] - Starting iteration 64. [2026-03-25 15:27:47,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:27:47,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:27:52,547][__main__][INFO] - Number of regex retries in iteration 64: 0 [2026-03-25 15:27:52,548][__main__][INFO] - agents played in iteration 64 are Bob, Alice [2026-03-25 15:27:53,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:27:53,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:27:53,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:27:53,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:27:53,779][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:27:54,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:27:55,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:27:55,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:27:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:27:57,285][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:27:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:27:58,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:27:59,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:28:00,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:28:00,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:28:01,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:28:02,290][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:28:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:28:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:28:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:28:05,154][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:28:05,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:28:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:28:07,304][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:28:08,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:28:08,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:28:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:28:10,171][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:28:10,886][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:28:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:28:12,320][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:28:13,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:28:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:28:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:28:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:28:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:28:16,620][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:28:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:28:18,053][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:28:18,768][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:28:19,485][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:28:20,201][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:28:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:28:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:28:22,350][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:28:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:28:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:28:24,499][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:28:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:28:25,931][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:28:26,649][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:28:27,366][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:28:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:28:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:28:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:28:30,455][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:28:31,173][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:28:31,891][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:28:32,607][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:28:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:28:34,041][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:28:34,758][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:28:35,476][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:28:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:28:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:28:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:28:38,620][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:28:39,337][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:28:40,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:28:40,777][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:28:41,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:28:41,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:28:41,940][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:28:43,306][__main__][INFO] - Iteration 65 took 55s (9.10% Gen, 88.45% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 26m 32s. Estimated total time: 15h 30m 41s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 20s. [2026-03-25 15:28:43,309][__main__][INFO] - Starting iteration 65. [2026-03-25 15:28:43,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:28:43,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:28:48,458][__main__][INFO] - Number of regex retries in iteration 65: 0 [2026-03-25 15:28:48,459][__main__][INFO] - agents played in iteration 65 are Bob, Alice [2026-03-25 15:28:49,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:28:49,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:28:49,087][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:28:49,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:28:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:28:50,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:28:51,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:28:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:28:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:28:53,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:28:54,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:28:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:28:55,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:28:56,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:28:56,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:28:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:28:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:28:59,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:28:59,743][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:29:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:29:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:29:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:29:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:29:03,322][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:29:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:29:04,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:29:05,469][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:29:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:29:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:29:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:29:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:29:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:29:09,763][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:29:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:29:11,195][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:29:11,911][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:29:12,627][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:29:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:29:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:29:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:29:15,492][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:29:16,207][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:29:16,925][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:29:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:29:18,358][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:29:19,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:29:19,791][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:29:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:29:21,226][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:29:21,943][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:29:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:29:23,377][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:29:24,399][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:29:25,118][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:29:25,833][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:29:26,550][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:29:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:29:27,987][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:29:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:29:29,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:29:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:29:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:29:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:29:32,287][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:29:33,005][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:29:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:29:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:29:35,157][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:29:35,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:29:36,601][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:29:37,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:29:37,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:29:37,902][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:29:39,155][__main__][INFO] - Iteration 66 took 55s (9.21% Gen, 88.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 25m 39s. Estimated total time: 15h 30m 44s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 22s. [2026-03-25 15:29:39,158][__main__][INFO] - Starting iteration 66. [2026-03-25 15:29:39,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:29:39,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:29:44,356][__main__][INFO] - Number of regex retries in iteration 66: 0 [2026-03-25 15:29:44,357][__main__][INFO] - agents played in iteration 66 are Bob, Alice [2026-03-25 15:29:44,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:29:44,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:29:44,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:29:44,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:29:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:29:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:29:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:29:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:29:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:29:49,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:29:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:29:50,563][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:29:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:29:51,994][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:29:52,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:29:53,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:29:54,143][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:29:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:29:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:29:56,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:29:57,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:29:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:29:58,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:29:59,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:29:59,865][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:30:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:30:01,298][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:30:02,013][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:30:02,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:30:03,446][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:30:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:30:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:30:05,592][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:30:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:30:07,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:30:07,746][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:30:08,462][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:30:09,181][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:30:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:30:10,615][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:30:11,332][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:30:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:30:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:30:13,483][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:30:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:30:14,915][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:30:15,630][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:30:16,349][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:30:17,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:30:17,781][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:30:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:30:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:30:20,261][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:30:20,981][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:30:21,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:30:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:30:23,131][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:30:23,847][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:30:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:30:25,280][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:30:25,996][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:30:26,714][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:30:27,432][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:30:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:30:28,867][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:30:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:30:30,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:30:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:30:31,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:30:32,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:30:33,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:30:33,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:30:33,808][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:30:35,239][__main__][INFO] - Iteration 67 took 56s (9.26% Gen, 88.18% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 28m 38s. Estimated total time: 15h 34m 39s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 19s. [2026-03-25 15:30:35,245][__main__][INFO] - Starting iteration 67. [2026-03-25 15:30:35,284][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:30:35,285][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:30:40,807][__main__][INFO] - Number of regex retries in iteration 67: 0 [2026-03-25 15:30:40,808][__main__][INFO] - agents played in iteration 67 are Bob, Alice [2026-03-25 15:30:41,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:30:41,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:30:41,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:30:41,359][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:30:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:30:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:30:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:30:44,127][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:30:44,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:30:45,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:30:46,276][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:30:46,991][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:30:47,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:30:48,423][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:30:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:30:49,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:30:50,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:30:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:30:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:30:52,721][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:30:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:30:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:30:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:30:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:30:56,300][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:30:57,017][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:30:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:30:58,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:30:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:30:59,885][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:31:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:31:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:31:02,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:31:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:31:03,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:31:04,184][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:31:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:31:05,616][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:31:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:31:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:31:07,766][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:31:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:31:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:31:09,917][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:31:10,633][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:31:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:31:12,067][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:31:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:31:13,501][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:31:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:31:14,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:31:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:31:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:31:17,330][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:31:18,047][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:31:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:31:19,482][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:31:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:31:20,915][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:31:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:31:22,351][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:31:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:31:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:31:24,502][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:31:25,219][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:31:25,937][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:31:26,653][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:31:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:31:28,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:31:28,868][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:31:30,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:31:30,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:31:30,075][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:31:31,860][__main__][INFO] - Iteration 68 took 56s (9.76% Gen, 87.08% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 35m 59s. Estimated total time: 15h 42m 57s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 17s, 500 more iterations: 7h 51m 28s. [2026-03-25 15:31:31,863][__main__][INFO] - Starting iteration 68. [2026-03-25 15:31:31,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:31:31,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:31:37,068][__main__][INFO] - Number of regex retries in iteration 68: 0 [2026-03-25 15:31:37,070][__main__][INFO] - agents played in iteration 68 are Bob, Alice [2026-03-25 15:31:37,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:31:37,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:31:37,605][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:31:37,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:31:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:31:38,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:31:39,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:31:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:31:41,078][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:31:41,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:31:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:31:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:31:43,943][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:31:44,659][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:31:45,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:31:46,095][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:31:46,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:31:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:31:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:31:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:31:49,680][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:31:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:31:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:31:51,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:31:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:31:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:31:53,986][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:31:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:31:55,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:31:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:31:56,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:31:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:31:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:31:59,011][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:31:59,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:32:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:32:01,167][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:32:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:32:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:32:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:32:04,037][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:32:04,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:32:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:32:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:32:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:32:07,625][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:32:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:32:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:32:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:32:10,499][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:32:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:32:11,934][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:32:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:32:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:32:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:32:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:32:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:32:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:32:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:32:17,936][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:32:18,652][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:32:19,371][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:32:20,088][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:32:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:32:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:32:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:32:22,960][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:32:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:32:24,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:32:25,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:32:26,265][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:32:26,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:32:26,270][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:32:27,618][__main__][INFO] - Iteration 69 took 55s (9.27% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 20m 42s. Estimated total time: 15h 28m 35s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 17s. [2026-03-25 15:32:27,621][__main__][INFO] - Starting iteration 69. [2026-03-25 15:32:27,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:32:27,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:32:32,740][__main__][INFO] - Number of regex retries in iteration 69: 0 [2026-03-25 15:32:32,741][__main__][INFO] - agents played in iteration 69 are Bob, Alice [2026-03-25 15:32:33,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:32:33,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:32:33,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:32:33,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:32:34,026][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:32:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:32:35,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:32:36,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:32:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:32:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:32:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:32:39,043][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:32:39,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:32:40,476][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:32:41,195][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:32:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:32:42,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:32:43,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:32:44,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:32:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:32:45,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:32:46,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:32:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:32:47,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:32:48,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:32:49,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:32:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:32:50,568][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:32:51,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:32:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:32:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:32:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:32:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:32:54,874][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:32:55,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:32:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:32:57,028][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:32:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:32:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:32:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:32:59,902][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:33:00,620][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:33:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:33:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:33:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:33:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:33:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:33:04,930][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:33:05,648][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:33:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:33:07,086][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:33:07,805][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:33:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:33:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:33:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:33:10,915][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:33:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:33:12,352][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:33:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:33:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:33:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:33:15,227][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:33:15,944][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:33:16,664][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:33:17,383][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:33:18,102][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:33:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:33:19,541][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:33:20,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:33:21,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:33:22,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:33:22,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:33:22,017][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:33:23,510][__main__][INFO] - Iteration 70 took 55s (9.25% Gen, 88.08% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 22m 38s. Estimated total time: 15h 31m 27s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 43s. [2026-03-25 15:33:23,520][__main__][INFO] - Starting iteration 70. [2026-03-25 15:33:23,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:33:23,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:33:28,761][__main__][INFO] - Number of regex retries in iteration 70: 0 [2026-03-25 15:33:28,762][__main__][INFO] - agents played in iteration 70 are Bob, Alice [2026-03-25 15:33:29,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:33:29,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:33:29,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:33:29,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:33:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:33:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:33:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:33:32,172][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:33:32,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:33:33,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:33:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:33:35,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:33:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:33:36,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:33:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:33:37,912][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:33:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:33:39,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:33:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:33:40,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:33:41,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:33:42,227][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:33:42,947][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:33:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:33:44,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:33:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:33:45,822][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:33:46,539][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:33:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:33:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:33:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:33:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:33:50,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:33:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:33:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:33:52,286][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:33:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:33:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:33:54,444][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:33:55,164][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:33:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:33:56,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:33:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:33:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:33:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:33:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:34:00,188][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:34:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:34:01,623][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:34:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:34:03,064][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:34:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:34:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:34:05,470][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:34:06,188][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:34:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:34:07,626][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:34:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:34:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:34:09,781][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:34:10,500][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:34:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:34:11,941][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:34:12,660][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:34:13,380][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:34:14,102][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:34:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:34:15,543][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:34:16,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:34:17,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:34:18,020][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:34:18,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:34:18,024][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:34:19,312][__main__][INFO] - Iteration 71 took 55s (9.37% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 19m 56s. Estimated total time: 15h 29m 41s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 50s. [2026-03-25 15:34:19,315][__main__][INFO] - Starting iteration 71. [2026-03-25 15:34:19,319][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:34:19,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:34:24,482][__main__][INFO] - Number of regex retries in iteration 71: 0 [2026-03-25 15:34:24,483][__main__][INFO] - agents played in iteration 71 are Bob, Alice [2026-03-25 15:34:24,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:34:25,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:34:25,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:34:25,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:34:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:34:26,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:34:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:34:27,811][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:34:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:34:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:34:29,967][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:34:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:34:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:34:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:34:32,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:34:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:34:34,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:34:35,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:34:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:34:36,442][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:34:37,162][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:34:37,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:34:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:34:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:34:40,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:34:40,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:34:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:34:42,196][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:34:42,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:34:43,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:34:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:34:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:34:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:34:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:34:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:34:47,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:34:48,671][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:34:49,391][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:34:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:34:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:34:51,550][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:34:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:34:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:34:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:34:54,425][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:34:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:34:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:34:56,581][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:34:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:34:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:34:58,737][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:34:59,457][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:35:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:35:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:35:01,880][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:35:02,600][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:35:03,320][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:35:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:35:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:35:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:35:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:35:06,917][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:35:07,637][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:35:08,357][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:35:09,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:35:09,799][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:35:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:35:11,240][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:35:11,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:35:12,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:35:13,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:35:13,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:35:13,702][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:35:15,129][__main__][INFO] - Iteration 72 took 55s (9.25% Gen, 88.19% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 19m 31s. Estimated total time: 15h 30m 12s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 6s. [2026-03-25 15:35:15,132][__main__][INFO] - Starting iteration 72. [2026-03-25 15:35:15,136][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:35:15,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:35:20,328][__main__][INFO] - Number of regex retries in iteration 72: 0 [2026-03-25 15:35:20,329][__main__][INFO] - agents played in iteration 72 are Bob, Alice [2026-03-25 15:35:20,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:35:20,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:35:20,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:35:20,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:35:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:35:22,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:35:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:35:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:35:24,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:35:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:35:25,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:35:26,563][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:35:27,281][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:35:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:35:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:35:29,437][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:35:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:35:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:35:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:35:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:35:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:35:33,742][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:35:34,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:35:35,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:35:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:35:36,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:35:37,330][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:35:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:35:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:35:39,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:35:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:35:40,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:35:41,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:35:42,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:35:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:35:43,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:35:44,499][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:35:45,218][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:35:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:35:46,654][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:35:47,370][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:35:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:35:48,804][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:35:49,522][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:35:50,239][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:35:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:35:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:35:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:35:53,113][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:35:53,830][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:35:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:35:55,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:35:56,209][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:35:56,928][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:35:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:35:58,364][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:35:59,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:35:59,801][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:36:00,518][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:36:01,237][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:36:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:36:02,674][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:36:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:36:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:36:04,827][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:36:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:36:06,263][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:36:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:36:07,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:36:08,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:36:09,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:36:09,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:36:09,586][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:36:11,004][__main__][INFO] - Iteration 73 took 55s (9.29% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 19m 33s. Estimated total time: 15h 31m 10s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 7s, 500 more iterations: 7h 45m 35s. [2026-03-25 15:36:11,008][__main__][INFO] - Starting iteration 73. [2026-03-25 15:36:11,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:36:11,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:36:16,114][__main__][INFO] - Number of regex retries in iteration 73: 0 [2026-03-25 15:36:16,115][__main__][INFO] - agents played in iteration 73 are Bob, Alice [2026-03-25 15:36:16,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:36:16,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:36:16,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:36:16,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:36:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:36:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:36:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:36:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:36:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:36:20,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:36:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:36:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:36:23,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:36:23,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:36:24,543][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:36:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:36:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:36:26,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:36:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:36:28,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:36:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:36:29,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:36:30,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:36:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:36:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:36:32,424][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:36:33,140][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:36:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:36:34,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:36:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:36:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:36:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:36:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:36:38,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:36:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:36:39,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:36:40,309][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:36:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:36:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:36:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:36:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:36:43,895][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:36:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:36:45,333][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:36:46,049][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:36:46,768][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:36:47,485][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:36:48,202][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:36:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:36:49,637][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:36:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:36:51,073][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:36:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:36:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:36:53,469][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:36:54,187][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:36:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:36:55,622][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:36:56,339][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:36:57,058][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:36:57,775][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:36:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:36:59,210][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:36:59,928][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:37:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:37:01,363][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:37:02,083][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:37:02,803][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:37:03,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:37:04,388][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:37:05,537][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:37:05,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:37:05,542][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:37:06,863][__main__][INFO] - Iteration 74 took 55s (9.13% Gen, 88.50% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 18m 17s. Estimated total time: 15h 30m 50s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 25s. [2026-03-25 15:37:06,866][__main__][INFO] - Starting iteration 74. [2026-03-25 15:37:06,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:37:06,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:37:12,058][__main__][INFO] - Number of regex retries in iteration 74: 0 [2026-03-25 15:37:12,060][__main__][INFO] - agents played in iteration 74 are Bob, Alice [2026-03-25 15:37:12,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:37:12,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:37:12,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:37:12,650][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:37:13,364][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:37:14,011][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:37:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:37:15,449][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:37:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:37:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:37:17,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:37:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:37:19,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:37:19,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:37:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:37:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:37:21,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:37:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:37:23,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:37:24,067][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:37:24,783][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:37:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:37:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:37:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:37:27,650][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:37:28,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:37:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:37:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:37:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:37:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:37:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:37:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:37:33,385][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:37:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:37:34,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:37:35,536][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:37:36,253][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:37:36,969][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:37:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:37:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:37:39,123][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:37:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:37:40,559][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:37:41,278][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:37:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:37:42,714][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:37:43,432][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:37:44,150][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:37:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:37:45,587][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:37:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:37:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:37:48,006][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:37:48,726][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:37:49,444][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:37:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:37:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:37:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:37:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:37:53,042][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:37:53,762][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:37:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:37:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:37:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:37:56,634][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:37:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:37:58,072][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:37:58,789][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:37:59,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:00,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:38:01,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:38:01,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:38:01,384][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:38:02,677][__main__][INFO] - Iteration 75 took 55s (9.30% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 16m 40s. Estimated total time: 15h 30m 8s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 4s. [2026-03-25 15:38:02,679][__main__][INFO] - Starting iteration 75. [2026-03-25 15:38:02,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:38:02,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:38:07,961][__main__][INFO] - Number of regex retries in iteration 75: 0 [2026-03-25 15:38:07,962][__main__][INFO] - agents played in iteration 75 are Bob, Alice [2026-03-25 15:38:08,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:38:08,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:38:08,508][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:38:08,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:38:09,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:38:09,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:38:10,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:38:11,288][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:38:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:38:12,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:38:13,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:38:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:38:14,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:38:15,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:38:16,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:38:17,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:38:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:38:18,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:38:19,184][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:38:19,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:38:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:38:21,338][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:38:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:38:22,774][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:38:23,492][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:38:24,210][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:38:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:38:25,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:38:26,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:38:27,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:38:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:38:28,524][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:38:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:38:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:38:30,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:38:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:38:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:38:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:38:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:38:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:38:34,989][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:38:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:38:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:38:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:38:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:38:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:38:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:38:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:38:40,738][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:38:41,456][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:38:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:38:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:38:43,849][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:38:44,570][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:38:45,289][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:38:46,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:38:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:38:47,451][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:38:48,172][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:38:48,891][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:38:49,611][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:38:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:38:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:38:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:38:52,489][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:38:53,209][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:38:53,929][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:38:54,646][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:38:55,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:56,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:38:57,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:38:57,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:38:57,060][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:38:58,351][__main__][INFO] - Iteration 76 took 55s (9.48% Gen, 88.19% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 13m 25s. Estimated total time: 15h 27m 50s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 55s. [2026-03-25 15:38:58,354][__main__][INFO] - Starting iteration 76. [2026-03-25 15:38:58,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:38:58,358][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:03,538][__main__][INFO] - Number of regex retries in iteration 76: 0 [2026-03-25 15:39:03,539][__main__][INFO] - agents played in iteration 76 are Bob, Alice [2026-03-25 15:39:04,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:39:04,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:39:04,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:39:04,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:39:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:39:05,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:39:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:39:06,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:39:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:39:08,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:39:09,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:39:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:39:10,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:39:11,181][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:39:11,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:39:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:39:13,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:39:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:39:14,778][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:39:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:39:16,216][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:39:16,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:39:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:39:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:39:19,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:39:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:39:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:39:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:39:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:39:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:39:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:39:24,129][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:39:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:39:25,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:39:26,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:39:27,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:39:27,729][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:39:28,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:39:29,170][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:39:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:39:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:39:31,329][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:39:32,048][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:39:32,769][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:39:33,487][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:39:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:39:34,928][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:39:35,648][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:39:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:39:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:39:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:39:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:39:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:39:40,223][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:39:40,943][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:39:41,666][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:39:42,387][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:39:43,108][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:39:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:39:44,552][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:39:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:39:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:39:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:39:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:39:48,161][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:39:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:39:49,602][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:39:50,324][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:39:51,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:39:51,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:39:52,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:39:52,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:39:52,939][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:39:54,374][__main__][INFO] - Iteration 77 took 56s (9.25% Gen, 88.19% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 18m 17s. Estimated total time: 15h 33m 37s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 21s, 500 more iterations: 7h 46m 48s. [2026-03-25 15:39:54,376][__main__][INFO] - Starting iteration 77. [2026-03-25 15:39:54,380][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:39:54,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:59,489][__main__][INFO] - Number of regex retries in iteration 77: 0 [2026-03-25 15:39:59,490][__main__][INFO] - agents played in iteration 77 are Bob, Alice [2026-03-25 15:39:59,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:40:00,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:40:00,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:40:00,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:40:00,776][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:02,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:02,867][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:40:04,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:40:05,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:40:05,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:40:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:40:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:40:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:40:08,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:40:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:40:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:40:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:40:11,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:40:12,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:40:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:40:13,652][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:40:14,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:40:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:40:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:40:16,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:40:17,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:40:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:40:18,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:40:19,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:40:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:40:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:40:21,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:40:22,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:40:23,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:40:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:40:24,446][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:40:25,165][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:40:25,884][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:40:26,603][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:40:27,323][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:40:28,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:40:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:40:29,480][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:40:30,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:40:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:40:31,636][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:40:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:40:33,075][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:40:33,794][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:40:34,513][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:40:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:40:36,212][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:40:36,930][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:40:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:40:38,368][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:40:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:40:39,808][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:40:40,528][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:40:41,249][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:40:41,970][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:40:42,688][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:40:43,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:40:44,129][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:40:44,849][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:40:45,569][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:40:46,290][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:40:47,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:40:47,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:40:48,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:40:48,767][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:40:48,768][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:40:50,114][__main__][INFO] - Iteration 78 took 55s (9.17% Gen, 88.42% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 12m 39s. Estimated total time: 15h 28m 55s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 53s, 500 more iterations: 7h 44m 27s. [2026-03-25 15:40:50,116][__main__][INFO] - Starting iteration 78. [2026-03-25 15:40:50,121][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:40:50,122][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:55,362][__main__][INFO] - Number of regex retries in iteration 78: 0 [2026-03-25 15:40:55,363][__main__][INFO] - agents played in iteration 78 are Bob, Alice [2026-03-25 15:40:55,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:40:55,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:40:55,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:40:55,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:40:56,596][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:57,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:58,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:59,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:00,829][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:01,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:02,982][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:41:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:41:04,417][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:41:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:41:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:41:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:41:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:41:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:41:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:41:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:41:10,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:41:10,877][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:41:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:41:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:41:13,029][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:41:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:41:14,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:41:15,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:41:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:41:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:41:17,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:41:18,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:41:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:41:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:41:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:41:20,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:41:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:41:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:41:23,074][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:41:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:41:24,509][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:41:25,228][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:41:25,945][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:41:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:41:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:41:28,098][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:41:28,818][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:41:29,535][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:41:30,258][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:41:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:41:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:41:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:41:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:41:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:41:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:41:35,505][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:41:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:41:36,942][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:41:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:41:38,379][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:41:39,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:41:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:41:40,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:41:41,253][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:41:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:41:42,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:41:43,423][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:41:44,382][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:41:44,384][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:41:44,386][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:41:45,699][__main__][INFO] - Iteration 79 took 55s (9.43% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 9m 9s. Estimated total time: 15h 26m 20s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 10s. [2026-03-25 15:41:45,702][__main__][INFO] - Starting iteration 79. [2026-03-25 15:41:45,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:41:45,708][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:41:50,854][__main__][INFO] - Number of regex retries in iteration 79: 0 [2026-03-25 15:41:50,855][__main__][INFO] - agents played in iteration 79 are Bob, Alice [2026-03-25 15:41:51,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:41:51,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:41:51,426][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:41:51,427][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:41:52,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:41:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:41:53,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:41:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:41:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:56,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:57,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:58,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:41:59,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:41:59,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:02,069][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:03,503][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:42:04,938][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:42:05,654][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:42:06,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:42:07,089][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:42:07,808][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:42:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:42:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:42:09,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:42:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:42:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:42:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:42:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:42:13,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:42:14,261][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:42:14,979][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:42:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:42:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:42:17,138][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:42:17,857][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:42:18,577][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:42:19,299][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:42:20,017][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:42:20,739][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:42:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:42:22,179][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:42:22,902][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:42:23,620][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:42:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:42:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:42:25,778][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:42:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:42:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:42:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:42:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:42:29,615][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:42:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:42:31,053][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:42:31,773][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:42:32,494][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:42:33,215][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:42:33,938][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:42:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:42:35,384][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:42:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:42:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:42:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:42:38,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:42:39,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:42:40,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:42:40,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:42:40,078][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:42:41,344][__main__][INFO] - Iteration 80 took 55s (9.25% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 9m 12s. Estimated total time: 15h 27m 19s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 39s. [2026-03-25 15:42:41,346][__main__][INFO] - Starting iteration 80. [2026-03-25 15:42:41,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:42:41,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:42:46,415][__main__][INFO] - Number of regex retries in iteration 80: 0 [2026-03-25 15:42:46,416][__main__][INFO] - agents played in iteration 80 are Bob, Alice [2026-03-25 15:42:46,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:42:47,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:42:47,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:42:47,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:42:47,808][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:42:48,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:42:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:42:49,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:42:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:42:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:42:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:42:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:42:53,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:42:54,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:54,906][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:55,624][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:57,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:57,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:58,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:59,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:01,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:02,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:02,788][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:03,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:04,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:05,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:06,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:43:08,526][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:43:09,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:43:09,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:43:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:43:11,398][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:43:12,115][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:43:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:43:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:43:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:43:14,987][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:43:15,704][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:43:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:43:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:43:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:43:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:43:19,296][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:43:20,016][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:43:20,732][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:43:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:43:22,431][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:43:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:43:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:43:24,585][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:43:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:43:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:43:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:43:27,456][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:43:28,175][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:43:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:43:29,613][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:43:30,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:43:31,050][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:43:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:43:32,488][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:43:33,205][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:43:33,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:43:34,642][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:43:36,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:43:36,036][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:43:36,038][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:43:37,379][__main__][INFO] - Iteration 81 took 56s (9.04% Gen, 88.56% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 14m 47s. Estimated total time: 15h 33m 50s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 55s. [2026-03-25 15:43:37,382][__main__][INFO] - Starting iteration 81. [2026-03-25 15:43:37,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:43:37,387][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:43:42,682][__main__][INFO] - Number of regex retries in iteration 81: 0 [2026-03-25 15:43:42,684][__main__][INFO] - agents played in iteration 81 are Bob, Alice [2026-03-25 15:43:43,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:43:43,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:43:43,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:43:43,234][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:43:43,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:43:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:43:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:43:46,003][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:43:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:43:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:43:48,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:43:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:43:49,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:43:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:43:51,017][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:43:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:43:52,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:43:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:43:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:43:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:43:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:43:56,035][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:57,469][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:58,903][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:44:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:44:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:44:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:44:02,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:44:03,207][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:44:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:44:04,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:44:05,360][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:44:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:44:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:44:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:09,666][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:11,101][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:11,819][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:44:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:44:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:44:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:44:14,692][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:44:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:44:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:44:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:44:17,563][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:44:18,507][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:44:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:44:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:44:20,661][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:44:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:44:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:44:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:44:23,535][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:44:24,253][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:44:24,971][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:44:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:44:26,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:44:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:44:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:44:28,568][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:44:29,287][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:44:30,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:44:30,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:44:31,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:44:31,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:44:31,845][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:44:33,187][__main__][INFO] - Iteration 82 took 55s (9.49% Gen, 88.10% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 10m 3s. Estimated total time: 15h 30m 2s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 1s. [2026-03-25 15:44:33,191][__main__][INFO] - Starting iteration 82. [2026-03-25 15:44:33,195][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:44:33,196][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:44:38,394][__main__][INFO] - Number of regex retries in iteration 82: 0 [2026-03-25 15:44:38,395][__main__][INFO] - agents played in iteration 82 are Bob, Alice [2026-03-25 15:44:38,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:44:38,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:44:38,931][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:44:38,932][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:44:39,619][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:44:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:44:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:44:41,703][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:44:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:44:43,135][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:44:43,851][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:44:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:44:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:44:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:44:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:44:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:44:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:44:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:44:49,586][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:44:50,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:44:51,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:44:51,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:44:52,453][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:44:53,172][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:44:53,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:44:54,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:44:55,324][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:44:56,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:44:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:44:57,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:44:58,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:44:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:44:59,633][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:45:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:45:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:45:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:45:02,502][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:45:03,221][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:45:03,939][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:45:04,656][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:45:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:45:06,092][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:45:06,811][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:45:07,528][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:45:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:45:08,966][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:45:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:45:10,403][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:45:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:45:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:12,558][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:13,277][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:14,937][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:45:16,373][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:45:17,091][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:45:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:45:18,528][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:45:19,246][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:45:19,965][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:45:20,684][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:45:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:45:22,122][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:45:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:45:23,558][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:45:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:45:24,998][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:45:25,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:45:26,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:45:27,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:45:27,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:45:27,411][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:45:28,833][__main__][INFO] - Iteration 83 took 55s (9.34% Gen, 88.09% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 6m 26s. Estimated total time: 15h 27m 20s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 40s. [2026-03-25 15:45:28,836][__main__][INFO] - Starting iteration 83. [2026-03-25 15:45:28,841][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:45:28,842][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:45:37,735][__main__][INFO] - Number of regex retries in iteration 83: 0 [2026-03-25 15:45:37,736][__main__][INFO] - agents played in iteration 83 are Bob, Alice [2026-03-25 15:45:38,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:45:38,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:45:38,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:45:38,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:45:39,008][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:45:39,654][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:45:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:45:41,087][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:45:41,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:45:42,520][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:45:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:45:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:45:44,668][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:45:45,383][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:45:46,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:45:46,816][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:45:47,536][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:45:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:45:48,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:45:49,687][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:45:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:45:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:45:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:45:52,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:45:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:45:53,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:45:54,701][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:45:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:45:56,136][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:45:56,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:45:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:45:58,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:45:59,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:45:59,721][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:00,438][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:01,154][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:02,589][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:03,306][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:46:04,022][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:46:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:46:05,456][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:46:06,173][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:46:06,892][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:46:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:46:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:46:09,045][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:46:09,763][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:46:10,482][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:46:11,200][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:46:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:46:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:46:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:46:14,333][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:46:15,050][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:46:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:46:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:46:17,204][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:46:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:46:18,640][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:46:19,359][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:46:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:46:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:46:22,230][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:46:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:46:23,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:46:24,386][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:46:25,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:46:25,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:46:26,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:46:26,775][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:46:26,776][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:46:28,424][__main__][INFO] - Iteration 84 took 59s (14.93% Gen, 82.30% Train). Generation: 8s, Training: 49s. Estimated remaining time: 15h 11m 11s. Estimated total time: 16h 33m 5s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 18s, 500 more iterations: 8h 16m 32s. [2026-03-25 15:46:28,428][__main__][INFO] - Starting iteration 84. [2026-03-25 15:46:28,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:46:28,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:46:33,939][__main__][INFO] - Number of regex retries in iteration 84: 0 [2026-03-25 15:46:33,940][__main__][INFO] - agents played in iteration 84 are Bob, Alice [2026-03-25 15:46:34,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:46:34,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:46:34,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:46:34,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:46:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:46:35,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:46:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:46:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:46:37,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:46:38,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:46:39,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:46:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:46:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:46:41,552][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:46:42,270][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:46:42,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:46:43,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:46:44,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:46:45,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:46:45,851][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:46:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:46:47,285][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:46:48,002][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:46:48,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:46:49,436][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:46:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:46:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:46:51,585][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:46:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:46:53,020][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:46:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:46:54,456][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:46:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:46:55,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:57,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:58,760][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:59,476][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:47:00,194][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:00,912][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:01,633][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:47:04,501][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:47:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:47:05,937][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:47:06,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:47:07,372][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:47:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:47:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:47:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:47:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:47:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:47:11,909][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:47:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:47:13,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:47:14,063][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:47:14,781][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:47:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:47:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:47:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:47:17,654][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:47:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:47:19,090][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:47:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:47:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:47:21,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:47:21,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:47:23,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:47:23,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:47:23,009][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:47:24,433][__main__][INFO] - Iteration 85 took 56s (9.83% Gen, 87.62% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 10m 32s. Estimated total time: 15h 33m 22s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 41s. [2026-03-25 15:47:24,435][__main__][INFO] - Starting iteration 85. [2026-03-25 15:47:24,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:47:24,440][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:47:29,584][__main__][INFO] - Number of regex retries in iteration 85: 0 [2026-03-25 15:47:29,585][__main__][INFO] - agents played in iteration 85 are Bob, Alice [2026-03-25 15:47:30,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:47:30,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:47:30,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:47:30,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:47:30,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:47:31,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:47:32,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:47:32,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:47:33,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:47:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:47:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:47:35,747][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:47:36,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:47:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:47:37,897][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:47:38,617][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:47:39,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:47:40,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:47:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:47:41,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:47:42,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:47:42,919][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:47:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:47:44,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:47:45,070][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:47:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:47:46,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:47:47,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:47:47,939][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:47:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:47:49,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:47:50,091][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:47:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:47:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:47:52,244][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:47:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:47:53,681][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:47:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:47:55,115][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:47:55,832][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:56,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:57,267][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:48:00,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:48:00,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:48:01,578][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:48:02,297][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:48:03,014][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:48:03,734][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:48:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:48:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:48:06,127][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:48:06,846][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:48:07,565][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:48:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:48:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:48:10,301][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:48:11,022][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:48:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:48:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:48:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:48:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:48:14,617][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:48:15,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:48:16,058][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:48:16,778][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:48:17,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:48:18,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:48:19,263][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:48:19,267][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:48:19,268][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:48:20,598][__main__][INFO] - Iteration 86 took 56s (9.16% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 12m 14s. Estimated total time: 15h 36m 0s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 0s. [2026-03-25 15:48:20,601][__main__][INFO] - Starting iteration 86. [2026-03-25 15:48:20,604][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:48:20,605][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:48:25,805][__main__][INFO] - Number of regex retries in iteration 86: 0 [2026-03-25 15:48:25,806][__main__][INFO] - agents played in iteration 86 are Bob, Alice [2026-03-25 15:48:26,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:48:26,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:48:26,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:48:26,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:48:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:48:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:48:28,499][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:48:29,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:48:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:48:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:48:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:48:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:48:32,806][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:48:33,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:48:34,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:48:34,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:48:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:48:36,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:48:37,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:48:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:48:38,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:48:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:48:39,977][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:48:40,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:48:41,411][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:48:42,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:48:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:48:43,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:48:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:48:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:48:45,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:48:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:48:47,150][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:48:47,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:48:48,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:48:49,301][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:48:50,018][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:48:50,737][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:48:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:48:52,173][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:48:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:48:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:48:54,327][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:48:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:48:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:48:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:48:57,207][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:48:57,925][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:48:58,646][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:48:59,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:49:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:49:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:49:01,797][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:49:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:49:03,237][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:49:03,959][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:49:04,681][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:49:05,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:49:06,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:49:06,844][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:49:07,564][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:49:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:49:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:49:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:49:10,445][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:49:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:49:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:49:12,601][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:49:13,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:49:14,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:49:15,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:49:15,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:49:15,076][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:49:16,426][__main__][INFO] - Iteration 87 took 55s (9.32% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 5m 41s. Estimated total time: 15h 30m 23s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 11s. [2026-03-25 15:49:16,429][__main__][INFO] - Starting iteration 87. [2026-03-25 15:49:16,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:49:16,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:21,631][__main__][INFO] - Number of regex retries in iteration 87: 0 [2026-03-25 15:49:21,632][__main__][INFO] - agents played in iteration 87 are Bob, Alice [2026-03-25 15:49:22,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:49:22,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:49:22,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:49:22,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:49:22,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:49:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:49:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:49:25,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:49:25,734][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:49:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:49:27,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:49:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:49:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:49:29,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:49:30,041][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:49:30,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:49:31,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:49:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:49:32,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:49:33,626][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:49:34,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:49:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:49:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:49:36,497][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:49:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:49:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:49:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:49:39,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:49:40,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:49:40,804][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:49:41,520][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:49:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:49:42,955][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:49:43,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:49:44,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:49:45,110][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:49:45,827][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:49:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:49:47,266][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:49:47,984][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:49:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:49:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:49:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:49:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:49:51,578][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:49:52,296][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:49:53,014][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:49:53,732][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:49:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:49:55,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:49:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:49:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:49:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:49:58,273][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:49:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:49:59,712][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:50:00,429][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:50:01,148][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:50:01,869][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:50:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:50:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:50:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:50:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:50:05,465][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:50:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:50:06,903][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:50:07,622][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:50:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:50:09,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:50:09,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:50:10,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:50:10,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:50:10,917][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:50:12,232][__main__][INFO] - Iteration 88 took 55s (9.32% Gen, 88.32% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 4m 23s. Estimated total time: 15h 30m 1s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 0s. [2026-03-25 15:50:12,235][__main__][INFO] - Starting iteration 88. [2026-03-25 15:50:12,240][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:50:12,241][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:50:17,450][__main__][INFO] - Number of regex retries in iteration 88: 0 [2026-03-25 15:50:17,451][__main__][INFO] - agents played in iteration 88 are Bob, Alice [2026-03-25 15:50:17,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:50:18,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:50:18,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:50:18,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:50:18,745][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:50:19,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:50:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:20,828][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:25,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:26,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:50:28,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:50:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:50:30,150][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:50:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:50:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:50:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:50:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:50:33,739][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:50:34,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:50:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:50:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:50:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:50:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:50:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:50:38,765][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:50:39,483][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:50:40,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:50:40,925][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:50:41,645][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:50:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:50:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:50:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:50:44,522][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:50:45,242][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:50:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:50:46,680][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:50:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:50:48,119][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:50:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:50:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:50:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:50:50,990][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:50:51,710][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:50:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:50:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:50:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:50:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:50:55,540][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:50:56,259][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:50:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:50:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:50:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:50:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:50:59,853][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:51:00,572][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:51:01,292][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:51:02,010][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:51:02,729][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:51:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:51:04,165][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:51:04,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:51:05,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:51:06,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:51:06,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:51:06,583][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:51:08,027][__main__][INFO] - Iteration 89 took 55s (9.34% Gen, 88.07% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 3m 15s. Estimated total time: 15h 29m 49s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 54s. [2026-03-25 15:51:08,032][__main__][INFO] - Starting iteration 89. [2026-03-25 15:51:08,036][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:51:08,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:51:13,154][__main__][INFO] - Number of regex retries in iteration 89: 0 [2026-03-25 15:51:13,155][__main__][INFO] - agents played in iteration 89 are Bob, Alice [2026-03-25 15:51:13,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:51:13,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:51:13,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:51:13,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:51:14,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:51:15,084][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:51:15,802][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:51:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:51:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:51:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:51:18,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:51:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:51:20,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:51:20,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:51:21,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:51:22,257][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:51:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:51:23,692][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:51:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:51:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:25,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:26,563][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:28,717][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:30,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:51:33,027][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:51:33,744][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:51:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:51:35,181][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:51:35,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:51:36,618][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:51:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:51:38,054][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:51:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:51:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:51:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:51:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:51:41,647][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:51:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:51:43,085][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:51:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:51:44,521][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:51:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:51:45,958][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:51:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:51:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:51:48,116][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:51:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:51:49,816][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:51:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:51:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:51:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:51:52,690][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:51:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:51:54,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:51:54,848][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:51:55,565][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:51:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:51:57,003][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:51:57,722][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:51:58,440][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:51:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:51:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:52:00,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:52:01,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:52:02,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:52:02,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:52:02,356][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:52:03,750][__main__][INFO] - Iteration 90 took 55s (9.19% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 1m 6s. Estimated total time: 15h 28m 35s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 17s. [2026-03-25 15:52:03,752][__main__][INFO] - Starting iteration 90. [2026-03-25 15:52:03,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:52:03,758][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:52:08,838][__main__][INFO] - Number of regex retries in iteration 90: 0 [2026-03-25 15:52:08,839][__main__][INFO] - agents played in iteration 90 are Bob, Alice [2026-03-25 15:52:09,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:52:09,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:52:09,380][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:52:09,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:52:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:52:10,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:52:11,453][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:52:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:52:12,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:52:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:52:14,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:52:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:52:15,755][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:52:16,472][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:52:17,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:52:17,905][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:52:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:52:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:52:20,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:52:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:52:21,492][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:52:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:52:22,926][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:52:23,643][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:52:24,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:52:25,078][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:52:25,795][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:52:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:52:27,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:52:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:52:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:52:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:52:30,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:52:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:31,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:33,707][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:34,425][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:52:35,143][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:52:35,862][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:52:36,579][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:52:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:52:38,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:52:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:52:39,456][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:52:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:40,893][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:41,611][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:43,052][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:52:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:52:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:52:45,436][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:52:46,155][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:52:46,874][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:52:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:52:48,311][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:52:49,030][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:52:49,749][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:52:50,469][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:52:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:52:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:52:52,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:52:53,345][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:52:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:52:54,783][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:52:55,501][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:52:56,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:52:56,947][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:52:57,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:52:57,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:52:57,957][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:52:59,400][__main__][INFO] - Iteration 91 took 55s (9.13% Gen, 88.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 59m 0s. Estimated total time: 15h 27m 25s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 42s. [2026-03-25 15:52:59,404][__main__][INFO] - Starting iteration 91. [2026-03-25 15:52:59,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:52:59,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:53:04,515][__main__][INFO] - Number of regex retries in iteration 91: 0 [2026-03-25 15:53:04,517][__main__][INFO] - agents played in iteration 91 are Bob, Alice [2026-03-25 15:53:04,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:53:05,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:53:05,055][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:53:05,056][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:53:05,743][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:53:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:53:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:53:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:53:08,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:53:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:53:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:53:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:53:11,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:53:12,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:53:12,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:53:13,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:53:14,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:53:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:53:15,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:53:16,434][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:53:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:53:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:53:18,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:53:19,309][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:53:20,026][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:53:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:53:21,460][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:53:22,178][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:53:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:53:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:53:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:53:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:53:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:53:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:53:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:53:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:53:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:53:29,361][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:53:30,080][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:53:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:53:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:53:32,236][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:53:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:53:33,672][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:53:34,391][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:53:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:53:35,828][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:53:36,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:53:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:53:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:53:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:53:39,424][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:53:40,370][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:53:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:53:41,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:53:42,534][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:53:43,255][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:43,978][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:44,700][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:49,029][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:53:50,472][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:53:51,193][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:53:51,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:53:52,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:53:53,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:53:53,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:53:53,609][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:53:55,492][__main__][INFO] - Iteration 92 took 56s (9.11% Gen, 87.53% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 5m 25s. Estimated total time: 15h 34m 46s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 28s, 500 more iterations: 7h 47m 23s. [2026-03-25 15:53:55,495][__main__][INFO] - Starting iteration 92. [2026-03-25 15:53:55,499][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:53:55,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:54:03,128][__main__][INFO] - Number of regex retries in iteration 92: 0 [2026-03-25 15:54:03,130][__main__][INFO] - agents played in iteration 92 are Bob, Alice [2026-03-25 15:54:03,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:54:03,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:54:03,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:54:03,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:54:04,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:54:05,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:54:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:54:06,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:54:07,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:54:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:54:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:54:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:54:10,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:54:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:54:11,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:54:12,308][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:54:13,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:54:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:54:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:54:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:54:15,898][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:54:16,616][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:54:17,333][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:54:18,053][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:54:18,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:54:19,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:54:20,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:54:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:54:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:54:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:54:23,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:54:23,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:54:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:54:25,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:54:25,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:54:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:54:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:54:28,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:54:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:54:29,535][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:54:30,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:54:30,972][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:54:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:54:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:54:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:54:33,843][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:54:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:54:35,279][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:54:35,996][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:54:36,714][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:54:37,432][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:54:38,150][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:54:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:54:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:54:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:54:41,296][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:54:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:54:42,732][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:54:43,450][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:54:44,169][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:54:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:54:45,606][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:54:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:54:47,044][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:54:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:54:48,483][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:54:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:49,919][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:50,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:51,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:54:52,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:52,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:52,406][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:53,962][__main__][INFO] - Iteration 93 took 58s (13.05% Gen, 84.28% Train). Generation: 7s, Training: 49s. Estimated remaining time: 14h 44m 5s. Estimated total time: 16h 14m 24s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 26s, 500 more iterations: 8h 7m 12s. [2026-03-25 15:54:53,965][__main__][INFO] - Starting iteration 93. [2026-03-25 15:54:53,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:54:53,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:00,203][__main__][INFO] - Number of regex retries in iteration 93: 0 [2026-03-25 15:55:00,204][__main__][INFO] - agents played in iteration 93 are Bob, Alice [2026-03-25 15:55:00,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:55:00,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:55:00,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:55:00,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:55:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:02,821][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:03,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:04,254][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:04,971][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:05,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:06,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:55:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:55:07,837][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:55:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:55:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:55:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:55:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:55:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:55:12,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:55:12,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:55:13,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:55:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:55:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:55:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:55:16,446][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:55:17,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:55:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:55:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:55:19,315][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:55:20,032][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:55:20,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:55:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:55:22,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:55:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:55:23,618][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:55:24,337][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:55:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:55:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:55:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:55:27,206][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:55:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:55:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:55:29,361][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:55:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:55:30,797][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:55:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:55:32,231][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:55:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:55:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:55:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:55:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:55:36,054][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:55:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:55:37,490][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:55:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:55:38,929][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:55:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:55:40,365][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:55:41,082][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:55:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:55:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:55:43,236][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:55:43,957][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:55:44,674][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:55:45,392][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:55:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:55:46,829][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:55:47,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:55:48,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:55:49,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:55:49,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:55:49,315][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:55:50,779][__main__][INFO] - Iteration 94 took 56s (10.97% Gen, 86.44% Train). Generation: 6s, Training: 49s. Estimated remaining time: 14h 15m 36s. Estimated total time: 15h 46m 52s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 41s, 500 more iterations: 7h 53m 26s. [2026-03-25 15:55:50,782][__main__][INFO] - Starting iteration 94. [2026-03-25 15:55:50,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:55:50,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:55,887][__main__][INFO] - Number of regex retries in iteration 94: 0 [2026-03-25 15:55:55,889][__main__][INFO] - agents played in iteration 94 are Bob, Alice [2026-03-25 15:55:56,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:55:56,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:55:56,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:55:56,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:55:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:56:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:56:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:56:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:02,860][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:03,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:56:04,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:56:05,014][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:06,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:07,165][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:08,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:09,319][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:56:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:56:11,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:56:12,186][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:56:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:56:13,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:56:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:56:15,057][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:56:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:56:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:56:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:56:17,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:56:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:56:19,363][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:56:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:56:20,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:56:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:56:22,234][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:56:22,953][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:56:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:56:24,390][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:56:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:56:25,824][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:56:26,545][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:56:27,262][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:56:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:56:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:56:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:56:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:56:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:56:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:56:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:56:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:56:33,963][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:56:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:56:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:56:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:56:36,839][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:56:37,558][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:56:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:56:38,998][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:56:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:56:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:56:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:56:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:56:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:56:43,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:56:44,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:56:45,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:56:45,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:56:45,158][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:56:46,535][__main__][INFO] - Iteration 95 took 55s (9.15% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 56m 58s. Estimated total time: 15h 29m 10s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 35s. [2026-03-25 15:56:46,540][__main__][INFO] - Starting iteration 95. [2026-03-25 15:56:46,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:56:46,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:56:51,632][__main__][INFO] - Number of regex retries in iteration 95: 0 [2026-03-25 15:56:51,633][__main__][INFO] - agents played in iteration 95 are Bob, Alice [2026-03-25 15:56:52,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:56:52,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:56:52,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:56:52,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:56:52,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:56:53,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:56:54,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:56:55,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:56:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:56:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:56:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:56:57,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:58,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:57:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:57:00,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:57:01,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:57:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:57:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:57:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:57:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:57:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:57:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:57:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:57:07,221][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:57:07,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:57:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:57:09,377][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:57:10,095][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:57:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:57:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:57:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:13,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:14,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:15,125][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:57:15,843][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:57:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:57:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:57:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:57:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:57:19,440][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:57:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:57:20,878][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:57:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:57:22,316][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:57:23,036][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:57:23,755][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:57:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:57:25,194][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:57:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:57:26,632][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:57:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:57:28,311][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:57:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:57:29,751][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:57:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:57:31,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:57:31,911][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:57:32,628][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:57:33,349][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:57:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:57:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:57:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:57:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:57:36,946][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:57:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:57:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:57:39,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:57:39,900][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:57:41,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:57:41,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:57:41,018][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:57:42,299][__main__][INFO] - Iteration 96 took 55s (9.12% Gen, 88.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 56m 7s. Estimated total time: 15h 29m 15s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 37s. [2026-03-25 15:57:42,303][__main__][INFO] - Starting iteration 96. [2026-03-25 15:57:42,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:57:42,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:57:47,371][__main__][INFO] - Number of regex retries in iteration 96: 0 [2026-03-25 15:57:47,372][__main__][INFO] - agents played in iteration 96 are Bob, Alice [2026-03-25 15:57:47,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:57:47,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:57:47,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:57:47,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:57:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:57:49,242][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:57:49,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:57:50,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:57:51,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:57:52,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:57:52,832][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:57:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:57:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:57:54,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:57:55,704][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:57:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:57:57,139][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:57:57,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:57:58,575][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:57:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:58:00,012][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:58:00,730][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:58:01,448][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:58:02,167][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:58:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:58:03,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:58:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:58:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:58:05,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:58:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:58:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:58:07,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:58:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:58:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:58:10,077][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:58:10,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:58:11,515][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:58:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:58:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:58:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:58:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:58:15,114][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:58:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:58:16,553][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:58:17,272][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:58:17,991][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:58:18,711][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:58:19,431][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:58:20,150][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:58:20,870][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:58:21,591][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:58:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:58:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:58:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:58:24,722][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:58:25,444][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:58:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:58:26,882][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:58:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:58:28,323][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:58:29,043][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:58:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:58:30,484][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:58:31,204][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:58:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:58:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:58:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:58:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:58:34,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:58:35,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 15:58:36,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:58:36,781][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:58:36,783][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:58:38,615][__main__][INFO] - Iteration 97 took 56s (8.99% Gen, 87.75% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 4m 24s. Estimated total time: 15h 38m 28s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 14s. [2026-03-25 15:58:38,618][__main__][INFO] - Starting iteration 97. [2026-03-25 15:58:38,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:58:38,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:58:43,833][__main__][INFO] - Number of regex retries in iteration 97: 0 [2026-03-25 15:58:43,835][__main__][INFO] - agents played in iteration 97 are Bob, Alice [2026-03-25 15:58:44,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:58:44,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:58:44,367][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:58:44,368][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:58:45,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:58:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:58:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:58:47,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:58:47,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:58:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:58:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:58:50,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:58:50,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:58:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:58:52,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:58:52,881][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:58:53,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:58:54,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:58:55,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:58:55,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:58:56,482][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:58:57,203][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:58:57,922][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:58:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:58:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:59:00,083][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:59:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:59:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:59:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:59:02,961][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:59:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:59:04,400][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:59:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:59:05,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:59:06,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:59:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:59:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:59:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:59:09,443][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:59:10,164][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:59:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:59:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:59:12,329][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:59:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:59:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:59:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:59:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:59:15,935][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:59:16,657][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:59:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:59:18,099][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:59:18,821][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:59:19,773][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:59:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:59:21,218][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:59:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:59:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:59:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:59:24,102][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:59:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:59:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:59:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:59:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:59:27,709][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:59:28,431][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:59:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:59:29,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:59:30,599][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:59:31,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:59:32,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 15:59:33,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:59:33,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:59:33,105][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:59:34,425][__main__][INFO] - Iteration 98 took 55s (9.34% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 55m 4s. Estimated total time: 15h 30m 4s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 2s. [2026-03-25 15:59:34,427][__main__][INFO] - Starting iteration 98. [2026-03-25 15:59:34,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 15:59:34,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:59:39,529][__main__][INFO] - Number of regex retries in iteration 98: 0 [2026-03-25 15:59:39,530][__main__][INFO] - agents played in iteration 98 are Bob, Alice [2026-03-25 15:59:39,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:59:40,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 15:59:40,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 15:59:40,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 15:59:40,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:59:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:59:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:59:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:59:43,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:59:44,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:59:44,988][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:59:45,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:59:46,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:59:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:59:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:59:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:59:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:59:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:59:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:59:51,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:59:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:59:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:59:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:59:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:59:55,046][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:59:55,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:59:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:59:57,203][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:59:57,923][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:59:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:59:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:00:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:00:00,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:00:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:00:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:00:02,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:00:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:00:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:00:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:00:05,843][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:00:06,563][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:00:07,283][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:00:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:00:08,726][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:00:09,447][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:00:10,169][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:00:10,892][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:00:11,613][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:00:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:00:13,057][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:00:13,777][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:00:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:00:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:00:16,193][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:00:16,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:00:17,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:00:18,358][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:00:19,080][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:00:19,801][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:00:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:00:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:00:21,967][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:00:22,687][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:00:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:00:24,131][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:00:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:00:25,575][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:00:26,297][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:00:27,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:00:27,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:00:28,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:00:28,811][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:00:28,812][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:00:30,321][__main__][INFO] - Iteration 99 took 55s (9.12% Gen, 88.17% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 55m 35s. Estimated total time: 15h 31m 31s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 45s. [2026-03-25 16:00:30,324][__main__][INFO] - Starting iteration 99. [2026-03-25 16:00:30,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:00:30,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:00:35,652][__main__][INFO] - Number of regex retries in iteration 99: 0 [2026-03-25 16:00:35,653][__main__][INFO] - agents played in iteration 99 are Bob, Alice [2026-03-25 16:00:36,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:00:36,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:00:36,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:00:36,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:00:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:00:37,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:00:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:00:38,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:00:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:00:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:00:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:00:42,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:00:43,279][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:00:43,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:00:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:00:45,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:00:46,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:00:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:00:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:00:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:00:49,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:00:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:00:50,464][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:00:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:00:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:00:52,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:00:53,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:00:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:00:54,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:00:55,497][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:00:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:00:56,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:00:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:00:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:00:59,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:00:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:01:00,530][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:01:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:01:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:01:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:01:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:01:04,127][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:01:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:01:05,565][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:01:06,287][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:01:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:01:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:01:08,447][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:01:09,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:01:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:01:10,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:01:11,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:01:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:01:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:01:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:01:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:01:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:01:15,915][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:01:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:01:17,356][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:01:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:01:18,798][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:01:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:01:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:01:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:01:21,689][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:01:22,410][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:01:23,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:01:23,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:01:24,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:01:24,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:01:24,913][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:01:26,321][__main__][INFO] - Iteration 100 took 55s (9.51% Gen, 87.97% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 56m 23s. Estimated total time: 15h 33m 15s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 37s. [2026-03-25 16:01:26,323][__main__][INFO] - Starting iteration 100. [2026-03-25 16:01:26,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:01:26,328][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:01:31,373][__main__][INFO] - Number of regex retries in iteration 100: 0 [2026-03-25 16:01:31,374][__main__][INFO] - agents played in iteration 100 are Bob, Alice [2026-03-25 16:01:31,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:01:31,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:01:31,903][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:01:31,904][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:01:32,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:01:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:01:34,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:01:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:01:35,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:01:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:01:36,882][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:01:37,601][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:01:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:01:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:01:39,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:01:40,485][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:01:41,206][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:01:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:01:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:43,367][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:44,807][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:01:45,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:01:46,246][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:01:46,965][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:01:47,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:01:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:01:49,126][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:01:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:01:50,565][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:01:51,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:01:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:01:52,725][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:01:53,446][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:01:54,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:01:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:01:55,606][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:01:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:01:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:01:57,767][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:01:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:01:59,207][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:01:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:02:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:02:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:02:02,089][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:02:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:02:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:02:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:02:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:02:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:02:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:02:07,368][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:02:08,090][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:02:08,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:02:09,535][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:02:10,256][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:02:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:02:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:02:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:02:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:02:13,862][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:02:14,584][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:02:15,305][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:02:16,026][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:02:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:02:17,470][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:02:18,191][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:02:18,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:02:19,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:02:20,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:02:20,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:02:20,680][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:02:23,776][__main__][INFO] - Iteration 101 took 57s (8.78% Gen, 85.82% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 19m 41s. Estimated total time: 15h 57m 30s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 45s, 500 more iterations: 7h 58m 45s. [2026-03-25 16:02:23,779][__main__][INFO] - Starting iteration 101. [2026-03-25 16:02:23,783][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:02:23,783][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:02:28,945][__main__][INFO] - Number of regex retries in iteration 101: 0 [2026-03-25 16:02:28,946][__main__][INFO] - agents played in iteration 101 are Bob, Alice [2026-03-25 16:02:29,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:02:29,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:02:29,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:02:29,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:02:30,219][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:02:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:02:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:02:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:02:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:02:33,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:02:34,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:02:35,177][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:02:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:02:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:02:37,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:02:38,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:02:38,769][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:02:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:02:40,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:02:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:02:41,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:02:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:02:43,084][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:02:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:02:44,522][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:02:45,239][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:02:45,957][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:02:46,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:02:47,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:48,830][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:50,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:02:54,584][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:02:55,302][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:02:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:02:56,740][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:02:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:02:58,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:02:58,897][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:02:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:01,059][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:01,780][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:03,222][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:03:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:03:05,633][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:03:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:03:07,076][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:03:07,797][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:03:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:03:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:03:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:03:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:03:11,400][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:03:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:03:12,844][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:03:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:03:14,282][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:03:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:03:15,721][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:03:16,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:03:17,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:03:18,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:03:18,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:03:18,173][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:03:19,509][__main__][INFO] - Iteration 102 took 55s (9.26% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 50m 3s. Estimated total time: 15h 28m 48s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 24s. [2026-03-25 16:03:19,512][__main__][INFO] - Starting iteration 102. [2026-03-25 16:03:19,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:03:19,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:03:24,671][__main__][INFO] - Number of regex retries in iteration 102: 0 [2026-03-25 16:03:24,673][__main__][INFO] - agents played in iteration 102 are Bob, Alice [2026-03-25 16:03:25,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:03:25,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:03:25,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:03:25,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:03:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:03:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:03:27,325][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:03:28,046][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:03:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:03:29,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:03:30,206][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:03:30,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:03:31,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:03:32,365][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:03:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:03:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:03:34,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:03:35,246][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:03:35,968][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:03:36,687][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:03:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:03:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:03:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:03:39,570][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:03:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:03:41,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:03:41,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:03:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:03:43,167][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:03:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:03:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:03:45,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:03:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:03:46,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:03:47,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:03:48,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:03:48,921][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:03:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:50,358][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:51,079][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:51,797][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:52,516][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:03:53,236][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:03:53,954][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:03:54,673][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:03:55,393][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:56,111][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:56,829][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:57,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:58,988][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:59,708][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:04:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:02,851][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:04:05,009][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:04:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:04:06,448][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:04:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:04:07,886][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:04:08,607][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:04:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:04:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:04:10,766][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:04:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:04:12,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:04:12,937][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:04:13,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:04:13,942][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:04:13,944][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:04:15,339][__main__][INFO] - Iteration 103 took 55s (9.24% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 50m 43s. Estimated total time: 15h 30m 24s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 12s. [2026-03-25 16:04:15,342][__main__][INFO] - Starting iteration 103. [2026-03-25 16:04:15,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:04:15,347][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:04:20,414][__main__][INFO] - Number of regex retries in iteration 103: 0 [2026-03-25 16:04:20,415][__main__][INFO] - agents played in iteration 103 are Bob, Alice [2026-03-25 16:04:20,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:04:20,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:04:20,963][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:04:20,964][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:04:21,643][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:04:22,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:04:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:04:23,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:04:24,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:04:25,168][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:04:25,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:04:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:04:27,324][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:04:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:04:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:04:29,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:04:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:04:30,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:04:31,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:04:32,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:04:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:04:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:04:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:04:35,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:04:35,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:04:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:04:37,387][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:04:38,106][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:04:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:04:39,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:04:40,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:04:40,980][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:04:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:04:42,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:04:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:04:43,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:04:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:04:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:04:46,009][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:04:46,727][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:04:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:04:48,165][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:04:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:04:49,608][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:04:50,329][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:04:51,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:04:51,770][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:04:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:04:53,209][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:04:53,931][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:04:54,647][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:04:55,369][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:56,320][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:04:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:57,764][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:58,483][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:59,204][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:59,925][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:05:02,807][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:05:03,528][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:05:04,250][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:05:04,971][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:05:05,691][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:05:06,412][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:05:07,134][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:05:07,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:05:08,594][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:05:09,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:05:09,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:05:09,524][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:05:10,787][__main__][INFO] - Iteration 104 took 55s (9.14% Gen, 88.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 43m 25s. Estimated total time: 15h 24m 2s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 24s, 500 more iterations: 7h 42m 1s. [2026-03-25 16:05:10,789][__main__][INFO] - Starting iteration 104. [2026-03-25 16:05:10,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:05:10,795][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:05:15,985][__main__][INFO] - Number of regex retries in iteration 104: 0 [2026-03-25 16:05:15,987][__main__][INFO] - agents played in iteration 104 are Bob, Alice [2026-03-25 16:05:16,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:05:16,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:05:16,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:05:16,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:05:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:05:17,851][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:05:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:05:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:05:20,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:05:20,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:05:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:05:22,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:05:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:05:23,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:05:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:05:25,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:05:25,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:05:26,475][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:05:27,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:05:27,913][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:05:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:05:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:05:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:05:30,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:05:31,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:05:32,226][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:05:32,946][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:05:33,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:05:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:05:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:05:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:05:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:05:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:05:37,978][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:05:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:05:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:05:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:05:40,857][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:05:41,578][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:05:42,296][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:05:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:05:43,735][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:05:44,454][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:05:45,174][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:05:45,893][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:05:46,612][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:05:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:05:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:05:48,776][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:05:49,498][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:05:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:05:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:05:51,913][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:05:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:05:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:05:54,080][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:05:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:05:55,524][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:56,247][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:56,967][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:05:58,411][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:05:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:05:59,855][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:00,577][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:01,298][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:06:02,021][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:06:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:06:03,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:06:04,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:06:05,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:06:05,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:06:05,204][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:06:06,460][__main__][INFO] - Iteration 105 took 55s (9.33% Gen, 88.41% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 46m 15s. Estimated total time: 15h 27m 47s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 53s. [2026-03-25 16:06:06,463][__main__][INFO] - Starting iteration 105. [2026-03-25 16:06:06,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:06:06,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:06:11,803][__main__][INFO] - Number of regex retries in iteration 105: 0 [2026-03-25 16:06:11,804][__main__][INFO] - agents played in iteration 105 are Bob, Alice [2026-03-25 16:06:12,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:06:12,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:06:12,347][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:06:12,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:06:13,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:06:13,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:06:14,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:06:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:06:15,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:06:16,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:06:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:06:17,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:06:18,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:06:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:06:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:06:20,873][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:06:21,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:06:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:06:23,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:06:23,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:06:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:06:25,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:06:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:06:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:06:27,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:06:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:06:28,784][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:06:29,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:06:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:06:30,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:06:31,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:06:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:06:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:06:33,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:06:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:06:35,266][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:06:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:06:36,706][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:06:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:06:38,148][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:06:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:06:39,587][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:06:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:06:43,366][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:06:44,177][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:06:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:06:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:06:46,340][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:06:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:06:47,783][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:06:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:06:49,223][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:06:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:06:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:06:51,641][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:06:52,362][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:06:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:06:53,803][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:06:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:06:55,246][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:06:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:06:56,688][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:06:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:06:58,131][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:58,852][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:59,570][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:01,013][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:01,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:02,465][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:49 [2026-03-25 16:07:03,740][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:07:03,745][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:07:03,747][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:07:05,526][__main__][INFO] - Iteration 106 took 59s (9.04% Gen, 87.95% Train). Generation: 5s, Training: 51s. Estimated remaining time: 14h 41m 50s. Estimated total time: 16h 24m 21s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 26s, 500 more iterations: 8h 12m 10s. [2026-03-25 16:07:05,531][__main__][INFO] - Starting iteration 106. [2026-03-25 16:07:05,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:07:05,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:07:10,630][__main__][INFO] - Number of regex retries in iteration 106: 0 [2026-03-25 16:07:10,631][__main__][INFO] - agents played in iteration 106 are Bob, Alice [2026-03-25 16:07:11,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:07:11,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:07:11,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:07:11,165][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:07:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:07:12,493][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:07:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:07:13,934][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:07:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:07:15,371][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:07:16,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:07:16,810][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:07:17,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:07:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:07:18,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:07:19,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:07:20,406][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:07:21,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:07:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:07:22,560][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:07:23,279][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:07:23,997][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:07:24,716][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:07:25,436][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:07:26,154][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:07:26,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:07:27,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:07:28,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:07:29,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:07:29,750][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:07:30,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:07:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:07:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:07:32,627][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:07:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:07:34,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:07:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:07:35,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:07:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:07:36,956][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:07:37,677][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:07:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:07:39,121][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:07:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:07:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:07:41,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:07:42,005][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:07:42,727][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:07:43,447][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:07:44,166][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:07:44,888][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:07:45,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:07:46,559][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:07:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:07:48,000][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:07:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:07:49,443][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:07:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:07:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:07:51,611][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:07:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:07:53,056][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:07:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:07:54,499][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:07:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:07:55,945][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:56,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:58,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:58,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:07:59,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:07:59,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:07:59,970][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:08:04,276][__main__][INFO] - Iteration 107 took 58s (8.67% Gen, 83.99% Train). Generation: 5s, Training: 49s. Estimated remaining time: 14h 35m 32s. Estimated total time: 16h 19m 1s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 54s, 500 more iterations: 8h 9m 30s. [2026-03-25 16:08:04,279][__main__][INFO] - Starting iteration 107. [2026-03-25 16:08:04,284][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:08:04,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:08:09,397][__main__][INFO] - Number of regex retries in iteration 107: 0 [2026-03-25 16:08:09,399][__main__][INFO] - agents played in iteration 107 are Bob, Alice [2026-03-25 16:08:09,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:08:09,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:08:09,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:08:09,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:08:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:08:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:08:12,023][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:08:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:08:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:08:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:08:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:08:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:08:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:08:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:08:17,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:08:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:08:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:08:19,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:08:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:08:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:08:22,091][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:08:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:08:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:08:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:08:24,976][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:08:25,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:08:26,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:08:27,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:08:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:08:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:08:29,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:08:30,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:08:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:08:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:08:32,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:08:32,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:08:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:08:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:08:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:08:35,780][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:08:36,500][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:08:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:08:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:08:38,664][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:08:39,385][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:08:40,105][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:08:40,825][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:08:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:08:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:08:42,988][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:08:43,709][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:08:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:08:45,399][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:08:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:08:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:08:47,565][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:08:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:08:49,007][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:08:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:08:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:08:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:08:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:08:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:08:53,339][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:08:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:08:54,783][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:08:55,503][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:08:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:08:56,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:08:57,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:08:58,774][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:08:58,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:08:58,779][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:09:00,252][__main__][INFO] - Iteration 108 took 55s (9.14% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 48m 24s. Estimated total time: 15h 32m 50s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 25s. [2026-03-25 16:09:00,254][__main__][INFO] - Starting iteration 108. [2026-03-25 16:09:00,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:09:00,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:09:05,591][__main__][INFO] - Number of regex retries in iteration 108: 0 [2026-03-25 16:09:05,592][__main__][INFO] - agents played in iteration 108 are Bob, Alice [2026-03-25 16:09:06,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:09:06,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:09:06,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:09:06,135][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:09:06,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:09:07,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:09:08,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:09:08,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:09:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:09:10,353][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:09:11,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:09:11,790][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:09:12,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:09:13,230][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:09:13,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:09:14,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:09:15,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:09:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:09:16,829][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:09:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:09:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:09:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:09:19,702][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:09:20,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:09:21,141][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:09:21,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:09:22,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:09:23,297][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:09:24,016][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:09:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:09:25,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:09:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:09:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:09:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:09:28,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:09:29,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:09:29,770][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:09:30,489][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:09:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:09:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:09:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:09:33,368][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:09:34,087][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:09:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:09:35,526][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:09:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:09:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:09:37,690][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:09:38,412][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:09:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:09:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:09:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:09:41,556][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:09:42,279][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:09:43,000][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:09:43,720][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:09:44,442][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:09:45,161][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:09:45,883][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:09:46,605][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:09:47,326][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:09:48,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:09:48,770][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:09:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:09:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:09:50,930][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:09:51,651][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:09:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:09:53,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:09:53,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:09:55,269][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:09:55,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:09:55,276][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:09:56,590][__main__][INFO] - Iteration 109 took 56s (9.47% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 53m 30s. Estimated total time: 15h 38m 52s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 26s. [2026-03-25 16:09:56,592][__main__][INFO] - Starting iteration 109. [2026-03-25 16:09:56,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:09:56,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:01,670][__main__][INFO] - Number of regex retries in iteration 109: 0 [2026-03-25 16:10:01,670][__main__][INFO] - agents played in iteration 109 are Bob, Alice [2026-03-25 16:10:02,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:10:02,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:10:02,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:10:02,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:10:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:03,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:04,384][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:05,103][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:06,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:10:08,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:10:09,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:10:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:10:10,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:10:11,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:10:12,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:10:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:10:13,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:10:14,449][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:10:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:10:15,886][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:10:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:10:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:10:18,043][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:10:18,762][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:10:19,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:10:20,199][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:10:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:10:21,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:10:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:10:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:10:23,794][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:10:24,513][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:10:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:10:25,951][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:10:26,671][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:10:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:10:28,111][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:10:28,829][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:10:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:10:30,268][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:10:30,986][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:10:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:10:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:10:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:10:33,867][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:10:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:10:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:10:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:10:36,746][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:10:37,693][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:10:38,414][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:10:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:10:39,853][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:10:40,578][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:10:41,299][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:10:42,022][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:10:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:10:43,464][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:10:44,185][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:10:44,907][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:10:45,628][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:10:46,348][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:10:47,072][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:10:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:10:48,515][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:10:49,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:10:49,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:10:51,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:10:51,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:10:51,029][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:10:52,412][__main__][INFO] - Iteration 110 took 55s (9.09% Gen, 88.43% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 43m 58s. Estimated total time: 15h 30m 16s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 8s. [2026-03-25 16:10:52,414][__main__][INFO] - Starting iteration 110. [2026-03-25 16:10:52,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:10:52,420][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:57,502][__main__][INFO] - Number of regex retries in iteration 110: 0 [2026-03-25 16:10:57,503][__main__][INFO] - agents played in iteration 110 are Bob, Alice [2026-03-25 16:10:58,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:10:58,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:10:58,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:10:58,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:10:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:00,135][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:00,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:11:02,291][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:11:03,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:11:03,730][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:11:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:11:05,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:11:05,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:11:06,608][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:11:07,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:11:08,049][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:11:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:11:09,487][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:11:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:11:10,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:11:11,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:11:12,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:11:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:11:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:11:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:11:15,245][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:11:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:11:16,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:11:17,406][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:11:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:11:18,847][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:11:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:11:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:11:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:11:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:11:22,450][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:11:23,171][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:11:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:11:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:11:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:11:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:11:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:11:27,496][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:11:28,218][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:11:28,939][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:11:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:11:30,382][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:11:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:11:31,824][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:11:32,546][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:11:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:11:34,219][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:11:34,940][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:11:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:11:36,383][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:11:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:11:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:11:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:11:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:11:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:11:40,713][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:11:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:11:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:11:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:11:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:11:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:11:45,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:11:45,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:11:46,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:11:46,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:11:46,831][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:11:48,091][__main__][INFO] - Iteration 111 took 55s (9.13% Gen, 88.60% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 40m 39s. Estimated total time: 15h 27m 53s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s. [2026-03-25 16:11:48,093][__main__][INFO] - Starting iteration 111. [2026-03-25 16:11:48,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:11:48,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:11:53,931][__main__][INFO] - Number of regex retries in iteration 111: 0 [2026-03-25 16:11:53,933][__main__][INFO] - agents played in iteration 111 are Bob, Alice [2026-03-25 16:11:54,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:11:54,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:11:54,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:11:54,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:11:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:11:55,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:56,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:11:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:11:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:00,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:00,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:01,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:02,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:12:04,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:12:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:12:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:12:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:12:07,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:12:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:12:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:12:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:10,239][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:12:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:12:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:12:13,117][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:12:13,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:12:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:12:15,276][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:12:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:12:16,716][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:12:17,436][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:12:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:12:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:12:19,597][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:12:20,317][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:12:21,036][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:12:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:12:22,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:12:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:12:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:12:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:12:25,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:12:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:12:26,812][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:12:27,534][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:12:28,256][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:12:28,978][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:12:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:12:30,684][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:12:31,406][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:12:32,129][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:12:32,850][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:12:33,572][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:12:34,295][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:12:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:12:35,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:12:36,462][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:12:37,185][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:12:37,906][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:12:38,630][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:12:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:12:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:12:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:12:41,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:12:42,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:12:43,443][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:12:43,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:12:43,450][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:12:44,815][__main__][INFO] - Iteration 112 took 56s (10.29% Gen, 87.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 57m 9s. Estimated total time: 15h 45m 19s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 31s, 500 more iterations: 7h 52m 39s. [2026-03-25 16:12:44,819][__main__][INFO] - Starting iteration 112. [2026-03-25 16:12:44,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:12:44,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:12:50,032][__main__][INFO] - Number of regex retries in iteration 112: 0 [2026-03-25 16:12:50,033][__main__][INFO] - agents played in iteration 112 are Bob, Alice [2026-03-25 16:12:50,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:12:50,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:12:50,565][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:12:50,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:12:51,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:12:51,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:12:52,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:12:53,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:12:54,057][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:12:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:12:55,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:56,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:56,928][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:57,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:58,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:59,085][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:59,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:00,523][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:01,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:01,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:02,681][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:13:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:13:04,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:13:05,558][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:13:06,278][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:13:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:13:07,715][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:13:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:13:09,155][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:09,875][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:10,595][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:11,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:12,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:14,915][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:13:16,356][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:13:17,079][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:13:17,802][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:13:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:13:19,243][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:13:19,965][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:13:20,690][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:13:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:13:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:13:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:13:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:13:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:13:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:13:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:13:26,695][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:13:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:13:28,137][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:13:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:13:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:13:30,304][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:13:31,026][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:13:31,748][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:13:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:13:33,192][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:13:33,913][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:13:34,637][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:13:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:13:36,080][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:13:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:13:37,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:13:38,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:13:39,341][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:13:39,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:13:39,346][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:13:40,801][__main__][INFO] - Iteration 113 took 55s (9.31% Gen, 88.09% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 43m 52s. Estimated total time: 15h 32m 59s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 29s. [2026-03-25 16:13:40,803][__main__][INFO] - Starting iteration 113. [2026-03-25 16:13:40,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:13:40,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:13:46,058][__main__][INFO] - Number of regex retries in iteration 113: 0 [2026-03-25 16:13:46,060][__main__][INFO] - agents played in iteration 113 are Bob, Alice [2026-03-25 16:13:46,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:13:46,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:13:46,597][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:13:46,598][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:13:47,287][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:13:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:13:48,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:13:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:13:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:13:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:13:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:13:52,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:13:52,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:13:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:13:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:13:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:13:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:14:00,180][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:14:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:14:01,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:14:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:14:03,062][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:14:04,504][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:14:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:14:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:14:06,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:14:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:14:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:14:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:14:09,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:14:10,274][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:14:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:14:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:14:12,440][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:14:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:14:13,883][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:14:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:14:15,326][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:14:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:14:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:17,493][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:18,214][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:20,380][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:14:21,105][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:14:22,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:14:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:14:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:14:24,225][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:14:24,948][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:14:25,670][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:14:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:14:27,116][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:14:27,836][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:14:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:14:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:14:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:14:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:14:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:14:32,169][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:14:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:14:33,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:14:34,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:14:35,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:14:35,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:14:35,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:14:36,772][__main__][INFO] - Iteration 114 took 55s (9.38% Gen, 88.43% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 42m 45s. Estimated total time: 15h 32m 47s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 16s, 500 more iterations: 7h 46m 23s. [2026-03-25 16:14:36,775][__main__][INFO] - Starting iteration 114. [2026-03-25 16:14:36,780][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:14:36,780][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:14:41,870][__main__][INFO] - Number of regex retries in iteration 114: 0 [2026-03-25 16:14:41,871][__main__][INFO] - agents played in iteration 114 are Bob, Alice [2026-03-25 16:14:42,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:14:42,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:14:42,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:14:42,410][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:14:43,193][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:14:43,842][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:14:44,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:14:45,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:14:45,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:14:46,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:14:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:14:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:14:48,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:14:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:14:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:14:51,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:14:51,753][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:14:52,470][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:14:53,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:14:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:14:54,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:14:55,347][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:14:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:14:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:14:57,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:14:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:14:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:15:00,389][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:15:01,108][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:15:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:15:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:15:03,272][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:15:03,992][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:15:04,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:15:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:15:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:15:06,876][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:15:07,598][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:15:08,320][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:15:09,040][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:15:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:15:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:15:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:15:11,927][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:15:12,649][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:15:13,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:15:14,091][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:15:14,813][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:15:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:15:16,256][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:15:16,979][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:15:17,966][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:15:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:15:19,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:15:20,131][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:15:20,853][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:15:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:15:22,297][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:15:23,019][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:15:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:15:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:15:25,184][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:15:25,907][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:15:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:15:27,351][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:15:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:15:28,795][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:15:29,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:15:30,253][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:15:31,286][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:15:31,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:15:31,290][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:15:33,113][__main__][INFO] - Iteration 115 took 56s (9.04% Gen, 87.72% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 47m 56s. Estimated total time: 15h 38m 55s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 27s. [2026-03-25 16:15:33,115][__main__][INFO] - Starting iteration 115. [2026-03-25 16:15:33,120][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:15:33,120][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:15:38,264][__main__][INFO] - Number of regex retries in iteration 115: 0 [2026-03-25 16:15:38,265][__main__][INFO] - agents played in iteration 115 are Bob, Alice [2026-03-25 16:15:38,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:15:38,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:15:38,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:15:38,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:15:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:15:40,133][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:15:40,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:15:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:15:42,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:15:43,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:15:43,734][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:15:44,454][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:15:45,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:15:45,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:15:46,612][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:15:47,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:15:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:15:48,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:15:49,495][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:15:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:15:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:15:51,654][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:15:52,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:15:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:15:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:15:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:15:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:15:55,970][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:15:56,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:15:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:15:58,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:15:58,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:15:59,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:16:00,286][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:16:01,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:16:01,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:16:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:16:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:16:03,883][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:16:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:16:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:16:06,047][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:16:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:16:07,491][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:16:08,212][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:16:08,934][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:16:09,656][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:16:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:16:11,097][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:16:11,819][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:16:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:16:13,263][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:16:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:16:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:16:15,657][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:16:16,378][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:16:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:16:17,820][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:16:18,541][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:16:19,263][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:16:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:16:20,704][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:16:21,427][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:16:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:16:22,871][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:16:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:16:24,314][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:16:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:16:25,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:16:26,494][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:16:27,528][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:16:27,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:16:27,532][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:16:29,216][__main__][INFO] - Iteration 116 took 56s (9.17% Gen, 87.82% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 43m 3s. Estimated total time: 15h 34m 58s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 29s. [2026-03-25 16:16:29,219][__main__][INFO] - Starting iteration 116. [2026-03-25 16:16:29,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:16:29,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:16:34,403][__main__][INFO] - Number of regex retries in iteration 116: 0 [2026-03-25 16:16:34,403][__main__][INFO] - agents played in iteration 116 are Bob, Alice [2026-03-25 16:16:34,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:16:35,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:16:35,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:16:35,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:16:35,734][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:16:36,382][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:16:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:16:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:16:38,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:16:39,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:16:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:16:40,699][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:16:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:16:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:16:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:16:43,577][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:16:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:16:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:16:45,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:16:46,455][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:16:47,175][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:16:47,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:16:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:16:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:16:50,056][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:16:50,777][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:16:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:16:52,216][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:16:52,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:16:53,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:16:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:16:55,099][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:16:55,819][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:16:56,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:16:57,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:16:57,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:16:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:16:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:17:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:17:00,865][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:17:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:17:02,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:17:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:17:03,749][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:17:04,472][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:17:05,190][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:17:05,911][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:17:06,633][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:17:07,352][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:17:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:17:08,796][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:17:09,516][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:17:10,474][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:17:11,197][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:17:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:17:12,639][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:17:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:17:14,081][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:17:14,803][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:17:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:17:16,244][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:17:16,967][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:17:17,690][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:17:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:17:19,132][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:17:19,854][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:17:20,575][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:17:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:17:22,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:17:22,770][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:17:24,188][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:17:24,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:17:24,195][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:17:25,499][__main__][INFO] - Iteration 117 took 56s (9.20% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 45m 6s. Estimated total time: 15h 37m 57s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 58s. [2026-03-25 16:17:25,502][__main__][INFO] - Starting iteration 117. [2026-03-25 16:17:25,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:17:25,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:17:31,020][__main__][INFO] - Number of regex retries in iteration 117: 0 [2026-03-25 16:17:31,022][__main__][INFO] - agents played in iteration 117 are Bob, Alice [2026-03-25 16:17:31,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:17:31,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:17:31,600][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:17:31,601][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:17:32,312][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:17:32,960][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:17:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:17:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:17:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:17:35,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:17:36,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:17:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:17:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:17:38,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:17:39,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:17:40,139][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:17:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:17:41,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:17:42,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:17:43,018][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:17:43,736][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:17:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:17:45,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:17:45,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:17:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:17:47,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:17:48,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:17:48,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:17:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:17:50,219][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:17:50,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:17:51,657][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:17:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:17:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:17:53,814][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:17:54,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:17:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:17:55,972][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:17:56,691][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:17:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:17:58,131][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:17:58,850][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:17:59,570][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:18:00,288][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:18:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:18:01,728][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:18:02,446][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:18:03,167][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:18:03,887][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:18:04,606][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:18:05,326][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:18:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:18:07,016][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:18:07,738][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:18:08,458][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:18:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:18:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:18:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:18:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:18:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:18:12,780][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:18:13,499][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:18:14,220][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:18:14,941][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:18:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:18:16,381][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:18:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:18:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:18:18,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:18:19,293][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:18:20,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:18:20,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:18:20,409][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:18:21,678][__main__][INFO] - Iteration 118 took 56s (9.82% Gen, 87.92% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 42m 26s. Estimated total time: 15h 36m 13s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 6s. [2026-03-25 16:18:21,682][__main__][INFO] - Starting iteration 118. [2026-03-25 16:18:21,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:18:21,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:18:26,839][__main__][INFO] - Number of regex retries in iteration 118: 0 [2026-03-25 16:18:26,840][__main__][INFO] - agents played in iteration 118 are Bob, Alice [2026-03-25 16:18:27,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:18:27,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:18:27,380][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:18:27,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:18:28,072][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:18:28,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:18:29,439][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:18:30,157][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:18:30,873][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:18:31,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:18:32,310][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:18:33,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:18:33,747][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:18:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:18:35,183][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:18:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:18:36,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:18:37,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:18:38,056][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:18:38,778][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:18:39,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:18:40,219][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:18:40,938][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:18:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:18:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:18:43,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:18:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:18:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:18:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:18:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:18:46,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:18:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:18:48,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:18:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:18:49,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:18:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:18:51,026][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:18:51,747][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:18:52,468][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:18:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:18:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:18:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:18:55,353][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:18:56,073][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:18:56,793][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:18:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:18:58,236][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:18:58,956][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:18:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:00,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:02,801][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:03,524][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:19:04,965][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:19:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:19:06,408][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:19:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:19:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:19:08,571][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:19:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:19:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:19:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:19:11,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:19:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:19:12,894][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:19:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:19:14,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:19:15,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:19:16,144][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:19:16,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:19:16,149][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:19:17,545][__main__][INFO] - Iteration 119 took 55s (9.22% Gen, 88.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 36m 17s. Estimated total time: 15h 31m 0s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 30s. [2026-03-25 16:19:17,548][__main__][INFO] - Starting iteration 119. [2026-03-25 16:19:17,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:19:17,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:19:22,602][__main__][INFO] - Number of regex retries in iteration 119: 0 [2026-03-25 16:19:22,603][__main__][INFO] - agents played in iteration 119 are Bob, Alice [2026-03-25 16:19:23,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:19:23,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:19:23,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:19:23,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:19:23,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:19:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:19:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:19:25,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:19:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:19:27,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:19:28,063][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:19:28,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:19:29,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:19:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:19:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:19:31,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:19:32,371][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:19:33,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:19:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:19:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:19:35,245][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:19:35,965][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:19:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:19:37,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:19:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:19:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:19:39,559][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:19:40,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:19:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:19:41,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:19:42,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:19:43,156][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:19:43,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:19:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:19:45,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:19:46,034][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:19:46,754][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:19:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:19:48,194][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:19:48,914][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:19:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:19:50,353][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:19:51,075][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:19:51,795][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:19:52,514][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:19:53,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:19:53,955][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:19:54,675][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:19:55,395][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:58,503][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:59,948][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:20:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:01,387][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:02,110][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:03,550][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:20:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:20:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:20:05,712][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:20:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:20:07,154][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:20:07,875][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:20:08,597][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:20:09,318][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:20:10,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:20:10,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:20:12,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:20:12,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:20:12,063][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:20:13,706][__main__][INFO] - Iteration 120 took 56s (8.99% Gen, 88.08% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 40m 16s. Estimated total time: 15h 35m 56s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 58s. [2026-03-25 16:20:13,711][__main__][INFO] - Starting iteration 120. [2026-03-25 16:20:13,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:20:13,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:20:18,869][__main__][INFO] - Number of regex retries in iteration 120: 0 [2026-03-25 16:20:18,871][__main__][INFO] - agents played in iteration 120 are Bob, Alice [2026-03-25 16:20:19,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:20:19,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:20:19,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:20:19,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:20:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:20:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:20:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:20:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:20:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:20:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:20:24,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:20:25,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:20:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:20:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:20:27,209][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:20:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:20:28,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:20:29,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:20:30,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:20:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:20:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:20:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:20:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:20:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:20:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:20:35,130][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:20:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:20:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:20:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:20:38,008][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:20:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:20:39,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:20:40,167][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:20:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:20:41,606][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:20:42,327][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:20:43,047][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:20:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:20:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:20:45,208][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:20:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:20:46,647][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:20:47,368][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:20:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:20:48,808][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:20:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:20:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:20:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:20:51,690][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:20:52,411][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:20:53,131][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:20:53,851][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:20:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:20:55,546][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:20:56,266][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:20:56,987][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:58,428][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:59,149][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:59,868][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:21:00,588][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:21:01,310][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:21:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:21:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:21:03,473][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:21:04,194][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:21:04,914][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:21:05,635][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:21:06,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:21:07,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:21:08,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:21:08,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:21:08,350][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:21:09,650][__main__][INFO] - Iteration 121 took 55s (9.21% Gen, 88.46% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 35m 40s. Estimated total time: 15h 32m 16s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 13s, 500 more iterations: 7h 46m 8s. [2026-03-25 16:21:09,654][__main__][INFO] - Starting iteration 121. [2026-03-25 16:21:09,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:21:09,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:21:14,745][__main__][INFO] - Number of regex retries in iteration 121: 0 [2026-03-25 16:21:14,746][__main__][INFO] - agents played in iteration 121 are Bob, Alice [2026-03-25 16:21:15,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:21:15,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:21:15,286][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:21:15,287][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:21:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:21:16,614][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:21:17,333][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:21:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:21:18,769][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:21:19,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:21:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:21:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:21:21,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:21:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:21:23,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:21:23,797][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:21:24,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:21:25,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:21:25,955][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:21:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:21:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:21:28,113][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:21:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:21:29,551][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:21:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:21:30,990][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:21:31,709][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:21:32,428][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:21:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:21:33,868][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:21:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:21:35,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:21:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:21:36,746][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:21:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:21:38,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:21:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:21:39,627][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:21:40,346][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:21:41,067][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:21:41,786][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:21:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:21:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:21:43,947][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:21:44,669][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:21:45,388][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:21:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:21:46,831][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:21:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:21:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:21:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:21:49,712][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:21:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:21:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:21:52,099][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:21:52,819][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:21:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:21:54,261][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:21:54,984][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:21:55,703][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:21:56,424][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:21:57,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:21:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:21:58,588][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:21:59,309][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:22:00,031][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:22:00,752][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:22:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:22:02,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:22:02,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:22:04,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:22:04,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:22:04,063][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:22:05,499][__main__][INFO] - Iteration 122 took 55s (9.11% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 33m 12s. Estimated total time: 15h 30m 43s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 21s. [2026-03-25 16:22:05,504][__main__][INFO] - Starting iteration 122. [2026-03-25 16:22:05,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:22:05,512][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:22:10,729][__main__][INFO] - Number of regex retries in iteration 122: 0 [2026-03-25 16:22:10,731][__main__][INFO] - agents played in iteration 122 are Bob, Alice [2026-03-25 16:22:11,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:22:11,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:22:11,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:22:11,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:22:11,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:22:12,597][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:22:13,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:22:14,036][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:22:14,753][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:22:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:22:16,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:22:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:22:17,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:22:18,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:22:19,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:22:19,787][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:22:20,505][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:22:21,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:22:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:22:22,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:22:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:22:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:22:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:22:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:22:26,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:22:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:22:27,701][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:22:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:22:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:22:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:22:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:22:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:22:32,020][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:22:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:22:33,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:22:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:22:34,900][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:22:35,620][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:22:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:22:37,065][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:22:37,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:22:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:22:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:22:39,954][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:22:40,675][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:22:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:22:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:22:42,841][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:22:43,562][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:22:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:22:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:22:45,727][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:22:46,684][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:22:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:22:48,126][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:22:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:22:49,569][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:22:50,292][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:22:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:22:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:22:52,461][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:22:53,181][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:22:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:22:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:22:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:22:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:22:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:22:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:22:58,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:22:58,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:23:00,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:23:00,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:23:00,086][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:23:01,380][__main__][INFO] - Iteration 123 took 55s (9.34% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 32m 46s. Estimated total time: 15h 31m 13s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 7s, 500 more iterations: 7h 45m 36s. [2026-03-25 16:23:01,383][__main__][INFO] - Starting iteration 123. [2026-03-25 16:23:01,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:23:01,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:23:06,528][__main__][INFO] - Number of regex retries in iteration 123: 0 [2026-03-25 16:23:06,529][__main__][INFO] - agents played in iteration 123 are Bob, Alice [2026-03-25 16:23:06,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:23:07,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:23:07,063][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:23:07,064][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:23:07,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:23:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:23:09,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:23:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:23:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:23:11,317][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:23:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:23:12,754][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:23:13,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:23:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:23:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:23:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:23:16,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:23:17,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:23:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:23:18,521][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:23:19,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:23:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:23:20,683][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:23:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:23:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:23:22,846][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:23:23,570][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:23:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:23:25,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:23:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:23:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:23:27,175][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:23:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:23:28,617][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:23:29,340][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:23:30,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:23:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:23:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:23:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:23:32,942][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:23:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:23:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:23:35,104][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:23:35,825][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:23:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:23:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:23:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:23:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:23:39,429][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:23:40,150][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:23:40,870][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:23:41,591][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:23:42,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:23:43,350][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:23:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:23:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:23:45,513][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:23:46,234][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:23:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:23:47,676][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:23:48,397][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:23:49,119][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:23:49,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:23:50,560][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:23:51,283][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:23:52,002][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:23:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:23:53,447][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:23:54,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:23:54,899][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:23:56,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:23:56,296][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:23:56,299][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:23:58,212][__main__][INFO] - Iteration 124 took 56s (9.05% Gen, 87.58% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 47m 42s. Estimated total time: 15h 47m 6s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 42s, 500 more iterations: 7h 53m 33s. [2026-03-25 16:23:58,215][__main__][INFO] - Starting iteration 124. [2026-03-25 16:23:58,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:23:58,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:24:03,364][__main__][INFO] - Number of regex retries in iteration 124: 0 [2026-03-25 16:24:03,365][__main__][INFO] - agents played in iteration 124 are Bob, Alice [2026-03-25 16:24:03,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:24:03,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:24:03,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:24:03,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:24:04,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:24:05,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:24:06,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:24:06,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:24:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:24:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:24:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:24:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:24:10,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:24:11,058][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:24:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:24:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:24:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:24:13,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:24:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:24:15,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:24:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:24:16,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:24:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:24:18,247][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:24:18,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:24:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:24:20,407][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:24:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:24:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:24:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:24:23,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:24:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:24:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:24:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:24:26,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:24:26,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:24:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:24:28,333][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:24:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:24:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:24:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:24:31,218][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:24:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:24:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:24:33,380][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:24:34,101][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:24:34,822][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:24:35,544][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:24:36,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:24:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:24:37,707][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:24:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:24:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:24:40,103][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:24:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:24:41,544][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:24:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:24:42,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:24:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:24:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:24:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:24:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:24:46,597][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:24:47,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:24:48,042][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:24:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:24:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:24:50,207][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:24:50,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:24:51,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:24:52,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:24:52,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:24:52,829][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:24:54,925][__main__][INFO] - Iteration 125 took 56s (9.07% Gen, 87.23% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 44m 47s. Estimated total time: 15h 45m 8s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 30s, 500 more iterations: 7h 52m 34s. [2026-03-25 16:24:54,929][__main__][INFO] - Starting iteration 125. [2026-03-25 16:24:54,935][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:24:54,936][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:25:01,648][__main__][INFO] - Number of regex retries in iteration 125: 0 [2026-03-25 16:25:01,649][__main__][INFO] - agents played in iteration 125 are Bob, Alice [2026-03-25 16:25:02,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:25:02,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:25:02,231][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:25:02,231][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:25:02,909][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:25:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:25:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:25:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:25:05,710][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:25:06,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:25:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:25:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:25:08,580][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:25:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:25:10,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:25:10,732][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:25:11,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:25:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:25:12,886][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:25:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:25:14,321][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:25:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:25:15,758][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:25:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:25:17,195][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:25:17,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:25:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:25:19,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:25:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:25:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:25:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:25:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:25:22,941][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:25:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:25:24,379][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:25:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:25:25,817][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:25:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:25:27,254][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:25:27,974][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:25:28,693][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:25:29,413][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:25:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:25:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:25:31,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:25:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:25:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:25:33,731][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:25:34,450][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:25:35,169][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:25:35,888][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:25:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:25:37,555][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:25:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:25:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:25:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:25:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:25:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:25:41,876][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:25:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:25:43,315][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:25:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:25:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:25:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:25:46,197][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:25:46,917][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:25:47,637][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:25:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:25:49,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:25:49,803][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:25:50,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:25:50,981][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:25:50,983][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:25:52,427][__main__][INFO] - Iteration 126 took 57s (11.68% Gen, 85.80% Train). Generation: 6s, Training: 49s. Estimated remaining time: 13h 56m 56s. Estimated total time: 15h 58m 14s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 49s, 500 more iterations: 7h 59m 7s. [2026-03-25 16:25:52,430][__main__][INFO] - Starting iteration 126. [2026-03-25 16:25:52,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:25:52,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:26:00,298][__main__][INFO] - Number of regex retries in iteration 126: 0 [2026-03-25 16:26:00,299][__main__][INFO] - agents played in iteration 126 are Bob, Alice [2026-03-25 16:26:00,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:26:00,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:26:00,838][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:26:00,839][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:26:01,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:26:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:26:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:26:03,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:26:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:26:05,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:26:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:26:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:26:07,175][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:26:07,892][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:26:08,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:26:09,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:26:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:26:10,763][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:26:11,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:26:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:26:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:26:13,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:26:14,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:26:15,068][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:26:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:26:16,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:26:17,221][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:26:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:26:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:26:19,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:26:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:26:20,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:26:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:26:22,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:26:22,966][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:26:23,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:26:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:26:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:26:25,841][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:26:26,559][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:26:27,278][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:26:27,998][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:26:28,715][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:26:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:26:30,152][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:26:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:26:31,591][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:26:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:26:33,029][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:26:33,749][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:26:34,468][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:26:35,188][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:26:36,147][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:26:36,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:26:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:26:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:26:39,027][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:26:39,745][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:26:40,466][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:26:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:26:41,904][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:26:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:26:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:26:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:26:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:26:45,502][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:26:46,221][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:26:46,942][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:26:47,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:26:48,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:26:49,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:26:49,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:26:49,454][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:26:50,717][__main__][INFO] - Iteration 127 took 58s (13.49% Gen, 84.33% Train). Generation: 7s, Training: 49s. Estimated remaining time: 14h 9m 8s. Estimated total time: 16h 11m 24s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 8s, 500 more iterations: 8h 5m 42s. [2026-03-25 16:26:50,720][__main__][INFO] - Starting iteration 127. [2026-03-25 16:26:50,724][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:26:50,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:26:55,797][__main__][INFO] - Number of regex retries in iteration 127: 0 [2026-03-25 16:26:55,798][__main__][INFO] - agents played in iteration 127 are Bob, Alice [2026-03-25 16:26:56,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:26:56,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:26:56,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:26:56,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:26:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:26:57,729][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:26:58,449][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:26:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:26:59,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:27:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:27:01,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:27:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:27:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:27:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:27:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:27:04,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:27:05,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:27:06,344][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:27:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:27:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:27:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:27:09,216][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:27:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:27:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:27:11,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:27:12,090][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:27:12,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:27:13,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:27:14,246][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:27:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:27:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:27:16,404][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:27:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:27:17,842][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:27:18,560][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:27:19,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:27:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:27:20,718][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:27:21,438][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:27:22,156][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:27:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:27:23,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:27:24,316][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:27:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:27:25,756][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:27:26,475][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:27:27,196][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:27:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:27:28,635][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:27:29,355][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:27:30,076][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:27:30,795][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:27:31,775][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:27:32,496][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:27:33,215][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:27:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:27:34,656][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:27:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:27:36,094][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:27:36,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:27:37,535][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:27:38,255][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:27:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:27:39,699][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:27:40,417][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:27:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:27:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:27:42,577][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:27:43,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:27:44,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:27:45,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:27:45,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:27:45,425][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:27:46,739][__main__][INFO] - Iteration 128 took 56s (9.06% Gen, 88.59% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 30m 24s. Estimated total time: 15h 33m 37s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 21s, 500 more iterations: 7h 46m 48s. [2026-03-25 16:27:46,743][__main__][INFO] - Starting iteration 128. [2026-03-25 16:27:46,747][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:27:46,748][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:28:00,889][__main__][INFO] - Number of regex retries in iteration 128: 0 [2026-03-25 16:28:00,890][__main__][INFO] - agents played in iteration 128 are Bob, Alice [2026-03-25 16:28:01,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:28:01,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:28:01,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:28:01,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:28:02,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:28:02,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:28:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:28:04,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:28:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:28:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:28:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:28:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:28:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:28:08,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:28:09,408][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:28:10,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:28:10,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:28:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:28:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:28:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:28:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:28:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:28:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:28:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:28:16,578][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:28:17,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:28:18,012][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:28:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:28:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:28:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:28:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:28:21,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:28:22,320][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:28:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:28:23,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:28:24,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:28:25,194][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:28:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:28:26,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:28:27,348][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:28:28,067][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:28:28,785][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:28:29,504][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:28:30,222][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:28:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:28:31,659][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:28:32,377][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:28:33,098][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:28:33,816][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:28:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:28:35,256][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:28:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:28:36,926][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:28:37,645][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:28:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:28:39,085][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:28:39,802][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:28:40,522][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:28:41,242][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:28:41,961][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:28:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:28:43,401][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:28:44,119][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:28:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:28:45,557][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:28:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:28:46,994][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:28:47,712][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:28:48,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:28:49,159][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:28:50,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:28:50,396][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:28:50,398][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:28:54,004][__main__][INFO] - Iteration 129 took 1m 7s (21.03% Gen, 73.61% Train). Generation: 14s, Training: 49s. Estimated remaining time: 16h 36m 39s. Estimated total time: 18h 40m 59s. Time estimates for 10 more iterations: 11m 12s, 100 more iterations: 1h 52m 5s, 500 more iterations: 9h 20m 29s. [2026-03-25 16:28:54,013][__main__][INFO] - Starting iteration 129. [2026-03-25 16:28:54,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:28:54,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:28:59,298][__main__][INFO] - Number of regex retries in iteration 129: 0 [2026-03-25 16:28:59,299][__main__][INFO] - agents played in iteration 129 are Bob, Alice [2026-03-25 16:28:59,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:28:59,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:28:59,841][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:28:59,842][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:29:00,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:01,175][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:01,895][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:03,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:29:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:29:04,764][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:29:05,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:29:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:29:06,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:29:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:29:08,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:29:09,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:29:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:29:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:29:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:29:11,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:29:12,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:29:13,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:29:14,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:29:14,813][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:29:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:29:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:29:16,969][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:29:17,687][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:29:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:29:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:29:19,842][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:29:20,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:29:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:29:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:29:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:29:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:29:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:29:24,877][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:29:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:29:26,317][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:29:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:29:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:29:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:29:29,197][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:29:29,917][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:29:30,636][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:29:31,354][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:29:32,072][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:29:32,790][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:29:33,508][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:29:34,227][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:29:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:29:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:29:36,611][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:29:37,332][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:29:38,051][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:29:38,772][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:29:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:29:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:29:40,929][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:29:41,649][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:29:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:29:43,088][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:29:43,808][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:29:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:29:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:29:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:29:46,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:29:47,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:29:48,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:29:48,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:29:48,491][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:29:50,394][__main__][INFO] - Iteration 130 took 56s (9.36% Gen, 87.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 34m 20s. Estimated total time: 15h 39m 36s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 57s, 500 more iterations: 7h 49m 48s. [2026-03-25 16:29:50,399][__main__][INFO] - Starting iteration 130. [2026-03-25 16:29:50,405][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:29:50,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:29:55,490][__main__][INFO] - Number of regex retries in iteration 130: 0 [2026-03-25 16:29:55,492][__main__][INFO] - agents played in iteration 130 are Bob, Alice [2026-03-25 16:29:55,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:29:56,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:29:56,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:29:56,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:29:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:00,332][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:01,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:02,491][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:30:03,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:30:04,652][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:30:05,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:30:06,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:30:06,812][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:30:07,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:30:08,253][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:30:08,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:30:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:30:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:30:11,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:30:11,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:30:12,581][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:30:13,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:30:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:30:14,742][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:30:15,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:30:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:30:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:30:17,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:30:18,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:30:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:30:19,787][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:30:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:30:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:30:21,949][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:30:22,668][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:30:23,389][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:30:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:30:24,829][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:30:25,550][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:30:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:30:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:30:27,711][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:30:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:30:29,151][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:30:29,873][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:30:30,594][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:30:31,583][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:30:32,305][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:30:33,027][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:30:33,747][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:30:34,468][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:30:35,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:30:35,913][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:30:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:30:37,354][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:30:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:30:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:30:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:30:40,238][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:30:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:30:41,681][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:30:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:30:43,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:30:43,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:30:44,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:30:44,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:30:44,989][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:30:46,594][__main__][INFO] - Iteration 131 took 56s (9.05% Gen, 88.08% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 30m 19s. Estimated total time: 15h 36m 31s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 15s. [2026-03-25 16:30:46,597][__main__][INFO] - Starting iteration 131. [2026-03-25 16:30:46,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:30:46,601][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:30:52,044][__main__][INFO] - Number of regex retries in iteration 131: 0 [2026-03-25 16:30:52,045][__main__][INFO] - agents played in iteration 131 are Bob, Alice [2026-03-25 16:30:52,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:30:52,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:30:52,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:30:52,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:30:53,374][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:30:54,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:30:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:30:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:30:56,175][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:57,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:59,763][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:01,201][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:01,919][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:02,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:31:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:31:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:31:05,511][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:31:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:31:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:31:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:31:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:31:09,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:31:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:31:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:31:11,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:31:11,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:31:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:31:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:31:14,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:31:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:31:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:31:16,295][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:31:17,013][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:31:17,733][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:31:18,452][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:31:19,172][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:31:19,892][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:31:20,613][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:31:21,333][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:31:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:31:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:31:23,497][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:31:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:31:24,937][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:31:25,660][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:31:26,382][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:31:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:31:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:31:28,775][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:31:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:31:30,217][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:31:30,938][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:31:31,659][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:31:32,380][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:31:33,102][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:31:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:31:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:31:35,266][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:31:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:31:36,708][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:31:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:31:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:31:38,873][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:31:39,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:31:40,328][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:31:41,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:31:41,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:31:41,668][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:31:43,097][__main__][INFO] - Iteration 132 took 56s (9.64% Gen, 87.83% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 34m 30s. Estimated total time: 15h 41m 38s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 49s. [2026-03-25 16:31:43,100][__main__][INFO] - Starting iteration 132. [2026-03-25 16:31:43,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:31:43,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:31:48,170][__main__][INFO] - Number of regex retries in iteration 132: 0 [2026-03-25 16:31:48,172][__main__][INFO] - agents played in iteration 132 are Bob, Alice [2026-03-25 16:31:48,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:31:48,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:31:48,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:31:48,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:31:49,429][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:31:50,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:31:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:31:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:31:52,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:31:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:31:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:31:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:31:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:31:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:56,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:57,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:57,989][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:58,708][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:32:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:00,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:02,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:32:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:32:03,746][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:32:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:32:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:32:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:32:06,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:32:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:32:08,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:32:08,788][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:32:09,510][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:32:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:32:10,950][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:32:11,672][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:32:12,392][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:32:13,113][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:32:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:32:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:32:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:32:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:32:16,716][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:32:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:32:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:32:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:32:19,599][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:32:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:32:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:32:21,760][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:32:22,481][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:32:23,199][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:32:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:32:24,887][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:32:25,606][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:32:26,326][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:32:27,047][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:32:27,769][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:32:28,487][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:32:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:32:29,928][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:32:30,650][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:32:31,369][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:32:32,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:32:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:32:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:32:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:32:34,977][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:32:35,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:32:36,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:32:37,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:32:37,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:32:37,842][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:32:40,459][__main__][INFO] - Iteration 133 took 57s (8.83% Gen, 86.60% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 47m 50s. Estimated total time: 15h 55m 56s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 35s, 500 more iterations: 7h 57m 58s. [2026-03-25 16:32:40,461][__main__][INFO] - Starting iteration 133. [2026-03-25 16:32:40,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:32:40,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:32:45,451][__main__][INFO] - Number of regex retries in iteration 133: 0 [2026-03-25 16:32:45,452][__main__][INFO] - agents played in iteration 133 are Bob, Alice [2026-03-25 16:32:45,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:32:45,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:32:45,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:32:45,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:32:46,714][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:32:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:32:48,080][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:32:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:32:49,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:32:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:32:50,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:32:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:32:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:32:53,105][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:32:53,821][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:32:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:32:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:32:55,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:32:56,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:32:57,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:58,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:58,851][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:59,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:33:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:33:01,006][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:33:01,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:33:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:33:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:33:03,880][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:33:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:33:05,318][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:33:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:33:06,756][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:33:07,473][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:33:08,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:33:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:33:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:33:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:33:11,071][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:33:11,790][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:33:12,510][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:33:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:33:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:33:14,671][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:33:15,393][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:33:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:33:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:33:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:33:18,276][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:33:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:33:19,718][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:33:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:33:21,422][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:33:22,144][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:33:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:33:23,586][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:33:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:33:25,029][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:33:25,750][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:33:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:33:27,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:33:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:33:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:33:29,358][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:33:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:33:30,803][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:33:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:33:32,242][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:33:32,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:33:33,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:33:34,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:33:34,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:33:34,987][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:33:37,644][__main__][INFO] - Iteration 134 took 57s (8.72% Gen, 86.63% Train). Generation: 4s, Training: 49s. Estimated remaining time: 13h 43m 56s. Estimated total time: 15h 53m 0s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 18s, 500 more iterations: 7h 56m 30s. [2026-03-25 16:33:37,647][__main__][INFO] - Starting iteration 134. [2026-03-25 16:33:37,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:33:37,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:33:42,798][__main__][INFO] - Number of regex retries in iteration 134: 0 [2026-03-25 16:33:42,799][__main__][INFO] - agents played in iteration 134 are Bob, Alice [2026-03-25 16:33:43,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:33:43,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:33:43,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:33:43,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:33:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:33:44,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:33:45,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:33:46,116][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:33:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:33:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:33:48,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:33:48,987][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:33:49,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:33:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:33:51,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:33:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:33:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:33:53,292][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:33:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:33:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:33:55,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:33:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:33:56,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:33:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:33:58,321][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:33:59,039][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:33:59,757][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:34:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:34:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:34:01,913][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:34:02,633][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:34:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:34:04,070][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:34:04,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:34:05,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:34:06,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:34:06,947][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:34:07,666][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:34:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:34:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:34:09,826][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:34:10,546][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:34:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:34:11,984][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:34:12,705][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:34:13,425][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:34:14,145][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:34:14,867][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:34:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:34:16,308][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:34:17,029][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:34:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:34:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:34:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:34:20,141][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:34:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:34:21,582][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:34:22,305][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:34:23,025][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:34:23,746][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:34:24,468][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:34:25,190][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:34:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:34:26,632][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:34:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:34:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:34:28,797][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:34:29,519][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:34:30,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:34:30,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:34:32,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:34:32,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:34:32,067][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:34:33,416][__main__][INFO] - Iteration 135 took 55s (9.23% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 19m 27s. Estimated total time: 15h 29m 26s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 43s. [2026-03-25 16:34:33,419][__main__][INFO] - Starting iteration 135. [2026-03-25 16:34:33,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:34:33,424][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:34:38,559][__main__][INFO] - Number of regex retries in iteration 135: 0 [2026-03-25 16:34:38,561][__main__][INFO] - agents played in iteration 135 are Bob, Alice [2026-03-25 16:34:39,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:34:39,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:34:39,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:34:39,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:34:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:34:40,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:34:41,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:34:41,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:34:42,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:34:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:34:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:34:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:34:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:34:46,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:34:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:34:47,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:34:48,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:34:49,199][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:34:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:34:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:34:51,358][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:34:52,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:34:52,796][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:34:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:34:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:34:54,957][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:34:55,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:34:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:34:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:34:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:34:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:34:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:34:59,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:35:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:35:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:35:02,159][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:35:02,881][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:35:03,601][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:35:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:35:05,043][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:35:05,764][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:35:06,483][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:35:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:35:07,922][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:35:08,642][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:35:09,362][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:35:10,083][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:35:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:35:11,523][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:35:12,242][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:35:12,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:35:13,684][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:35:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:35:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:35:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:35:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:35:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:35:18,289][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:35:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:35:19,729][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:35:20,449][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:35:21,172][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:35:21,891][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:35:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:35:24,293][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:35:25,014][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:35:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:35:26,455][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:35:27,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:35:30,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 16:35:31,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:35:31,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:35:31,511][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:35:33,406][__main__][INFO] - Iteration 136 took 59s (8.56% Gen, 88.27% Train). Generation: 5s, Training: 52s. Estimated remaining time: 14h 28m 45s. Estimated total time: 16h 39m 44s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 58s, 500 more iterations: 8h 19m 52s. [2026-03-25 16:35:33,412][__main__][INFO] - Starting iteration 136. [2026-03-25 16:35:33,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:35:33,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:35:42,416][__main__][INFO] - Number of regex retries in iteration 136: 0 [2026-03-25 16:35:42,417][__main__][INFO] - agents played in iteration 136 are Bob, Alice [2026-03-25 16:35:42,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:35:42,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:35:42,982][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:35:42,983][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:35:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:35:44,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:35:45,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:35:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:35:46,473][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:35:47,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:35:47,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:35:48,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:35:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:35:50,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:35:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:35:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:35:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:35:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:35:53,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:35:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:35:55,067][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:35:55,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:35:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:35:57,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:35:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:35:58,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:35:59,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:00,084][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:00,801][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:36:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:36:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:36:02,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:36:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:36:04,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:36:05,104][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:36:05,822][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:36:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:36:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:36:07,976][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:36:08,695][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:36:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:36:10,131][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:36:10,848][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:36:11,568][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:36:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:36:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:36:13,722][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:36:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:36:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:36:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:36:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:36:17,315][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:36:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:36:19,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:36:19,725][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:36:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:36:21,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:36:21,881][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:36:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:36:23,319][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:36:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:36:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:36:25,477][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:36:26,197][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:36:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:36:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:36:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:36:29,073][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:36:29,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:36:30,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:36:31,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:36:31,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:36:31,798][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:36:33,223][__main__][INFO] - Iteration 137 took 59s (15.04% Gen, 82.58% Train). Generation: 8s, Training: 49s. Estimated remaining time: 14h 24m 41s. Estimated total time: 16h 36m 40s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 40s, 500 more iterations: 8h 18m 20s. [2026-03-25 16:36:33,229][__main__][INFO] - Starting iteration 137. [2026-03-25 16:36:33,272][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:36:33,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:36:38,491][__main__][INFO] - Number of regex retries in iteration 137: 0 [2026-03-25 16:36:38,492][__main__][INFO] - agents played in iteration 137 are Bob, Alice [2026-03-25 16:36:38,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:36:39,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:36:39,038][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:36:39,038][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:36:39,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:36:40,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:36:41,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:36:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:36:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:36:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:36:43,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:36:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:36:45,398][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:36:46,115][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:36:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:36:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:36:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:36:48,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:36:49,702][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:36:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:36:51,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:36:51,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:36:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:36:53,293][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:36:54,012][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:36:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:36:55,450][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:56,887][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:36:57,606][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:36:58,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:36:59,043][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:36:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:00,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:01,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:01,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:37:02,635][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:37:03,355][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:37:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:37:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:37:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:37:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:37:06,951][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:37:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:37:08,389][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:37:09,109][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:37:09,828][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:37:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:37:11,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:37:11,986][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:37:12,706][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:37:13,424][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:37:14,365][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:37:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:37:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:37:16,525][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:37:17,243][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:37:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:37:18,682][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:37:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:37:20,122][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:37:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:37:21,562][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:37:22,281][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:37:23,001][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:37:23,720][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:37:24,442][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:37:25,161][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:37:25,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:37:26,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:37:29,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:37:29,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:37:29,055][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:37:30,697][__main__][INFO] - Iteration 138 took 57s (9.09% Gen, 88.04% Train). Generation: 5s, Training: 50s. Estimated remaining time: 13h 44m 12s. Estimated total time: 15h 57m 9s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 42s, 500 more iterations: 7h 58m 34s. [2026-03-25 16:37:30,700][__main__][INFO] - Starting iteration 138. [2026-03-25 16:37:30,704][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:37:30,705][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:37:35,900][__main__][INFO] - Number of regex retries in iteration 138: 0 [2026-03-25 16:37:35,901][__main__][INFO] - agents played in iteration 138 are Bob, Alice [2026-03-25 16:37:36,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:37:36,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:37:36,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:37:36,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:37:37,141][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:37:37,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:37:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:37:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:37:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:37:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:37:41,375][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:37:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:37:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:37:43,526][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:37:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:37:44,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:37:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:37:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:37:47,120][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:37:47,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:37:48,560][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:37:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:37:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:37:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:37:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:37:52,158][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:37:52,876][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:37:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:37:54,316][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:37:55,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:37:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:37:56,474][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:37:57,193][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:59,351][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:38:00,071][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:38:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:38:01,510][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:38:02,229][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:38:02,948][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:38:03,667][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:38:04,387][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:38:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:38:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:38:06,543][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:38:07,262][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:38:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:38:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:38:09,419][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:38:10,140][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:38:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:38:11,873][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:38:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:38:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:38:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:38:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:38:15,478][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:38:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:38:16,920][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:38:17,641][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:38:18,363][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:38:19,085][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:38:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:38:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:38:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:38:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:38:22,689][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:38:23,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:38:24,173][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:38:25,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:38:25,559][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:38:25,561][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:38:27,348][__main__][INFO] - Iteration 139 took 56s (9.17% Gen, 87.67% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 30m 12s. Estimated total time: 15h 44m 5s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 2s. [2026-03-25 16:38:27,352][__main__][INFO] - Starting iteration 139. [2026-03-25 16:38:27,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:38:27,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:38:32,558][__main__][INFO] - Number of regex retries in iteration 139: 0 [2026-03-25 16:38:32,559][__main__][INFO] - agents played in iteration 139 are Bob, Alice [2026-03-25 16:38:33,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:38:33,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:38:33,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:38:33,197][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:38:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:38:34,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:38:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:38:36,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:38:36,801][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:38:37,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:38:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:38:38,957][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:38:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:38:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:38:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:38:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:38:42,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:38:43,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:38:43,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:38:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:38:45,428][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:38:46,146][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:38:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:38:47,585][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:38:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:38:49,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:38:49,744][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:38:50,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:38:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:38:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:38:52,622][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:38:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:38:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:38:54,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:38:55,503][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:38:56,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:38:56,949][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:38:57,668][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:38:58,388][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:38:59,107][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:38:59,825][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:00,545][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:01,984][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:39:04,142][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:39:04,861][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:39:05,582][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:39:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:39:07,020][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:39:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:39:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:39:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:39:10,163][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:39:10,881][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:39:11,602][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:39:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:39:13,041][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:39:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:39:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:39:15,201][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:39:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:39:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:39:17,455][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:39:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:39:18,896][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:39:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:39:20,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:39:21,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:39:22,296][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:39:22,300][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:39:22,302][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:39:23,629][__main__][INFO] - Iteration 140 took 56s (9.24% Gen, 88.39% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 23m 5s. Estimated total time: 15h 37m 54s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 57s. [2026-03-25 16:39:23,641][__main__][INFO] - Starting iteration 140. [2026-03-25 16:39:23,653][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:39:23,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:39:28,855][__main__][INFO] - Number of regex retries in iteration 140: 0 [2026-03-25 16:39:28,856][__main__][INFO] - agents played in iteration 140 are Bob, Alice [2026-03-25 16:39:29,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:39:29,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:39:29,445][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:39:29,446][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:39:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:39:30,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:39:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:39:32,230][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:39:32,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:39:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:39:34,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:39:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:39:35,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:39:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:39:37,269][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:39:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:39:38,709][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:39:39,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:39:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:39:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:39:41,588][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:39:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:39:43,026][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:39:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:39:44,462][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:39:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:39:45,901][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:39:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:39:47,341][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:39:48,060][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:39:48,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:39:49,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:39:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:39:50,937][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:39:51,658][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:39:52,378][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:39:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:39:53,818][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:39:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:39:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:39:55,977][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:56,697][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:57,417][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:58,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:58,858][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:00,298][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:40:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:40:01,738][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:40:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:40:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:40:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:40:04,854][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:40:05,576][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:40:06,295][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:40:07,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:40:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:40:08,457][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:40:09,177][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:40:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:40:10,619][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:40:11,340][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:40:12,061][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:40:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:40:13,502][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:40:14,223][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:40:14,945][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:40:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:40:16,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:40:17,122][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:40:18,188][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:40:18,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:40:18,193][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:40:19,565][__main__][INFO] - Iteration 141 took 55s (9.30% Gen, 88.24% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 16m 7s. Estimated total time: 15h 31m 53s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 56s. [2026-03-25 16:40:19,568][__main__][INFO] - Starting iteration 141. [2026-03-25 16:40:19,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:40:19,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:40:23,730][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2026-03-25 16:40:28,403][__main__][INFO] - Number of regex retries in iteration 141: 1 [2026-03-25 16:40:28,405][__main__][INFO] - agents played in iteration 141 are Bob, Alice [2026-03-25 16:40:28,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:40:28,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:40:28,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:40:28,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:40:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:40:30,289][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:40:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:40:31,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:40:32,440][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:40:33,157][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:40:33,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:40:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:40:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:40:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:40:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:40:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:40:38,177][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:40:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:40:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:40:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:40:41,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:40:41,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:40:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:40:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:40:43,924][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:40:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:40:45,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:40:46,079][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:40:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:40:47,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:40:48,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:40:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:40:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:40:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:40:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:40:51,829][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:40:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:40:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:40:53,984][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:40:54,705][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:40:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:40:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:40:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:40:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:40:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:40:59,022][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:01,181][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:01,900][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:02,621][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:41:03,342][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:41:04,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:41:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:41:05,741][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:41:06,461][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:41:07,180][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:41:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:08,620][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:41:10,782][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:41:11,501][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:41:12,222][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:41:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:41:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:41:14,384][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:41:15,104][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:41:15,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:41:16,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:41:17,816][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:41:17,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:41:17,825][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:41:19,706][__main__][INFO] - Iteration 142 took 1m 0s (14.69% Gen, 82.18% Train). Generation: 8s, Training: 49s. Estimated remaining time: 14h 25m 31s. Estimated total time: 16h 42m 16s. Time estimates for 10 more iterations: 10m 1s, 100 more iterations: 1h 40m 13s, 500 more iterations: 8h 21m 8s. [2026-03-25 16:41:19,711][__main__][INFO] - Starting iteration 142. [2026-03-25 16:41:19,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:41:19,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:41:25,120][__main__][INFO] - Number of regex retries in iteration 142: 0 [2026-03-25 16:41:25,121][__main__][INFO] - agents played in iteration 142 are Bob, Alice [2026-03-25 16:41:25,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:41:25,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:41:25,669][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:41:25,670][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:41:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:41:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:41:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:41:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:41:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:41:29,868][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:41:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:41:31,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:41:32,021][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:41:32,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:41:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:41:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:41:34,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:41:35,613][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:41:36,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:41:37,052][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:41:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:41:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:41:39,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:41:39,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:41:40,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:41:41,364][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:41:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:41:42,804][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:41:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:41:44,241][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:41:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:41:45,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:41:46,399][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:41:47,120][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:41:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:41:48,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:41:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:41:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:41:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:41:51,436][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:41:52,156][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:41:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:41:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:41:54,315][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:41:55,034][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:41:55,756][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:41:56,476][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:57,196][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:42:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:42:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:42:01,780][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:42:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:42:03,223][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:42:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:42:04,662][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:42:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:42:06,105][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:42:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:42:07,544][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:42:08,268][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:42:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:42:09,708][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:42:10,429][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:42:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:42:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:42:12,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:42:13,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:42:14,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:14,575][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:14,577][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:15,888][__main__][INFO] - Iteration 143 took 56s (9.62% Gen, 88.04% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 18m 32s. Estimated total time: 15h 36m 13s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 6s. [2026-03-25 16:42:15,891][__main__][INFO] - Starting iteration 143. [2026-03-25 16:42:15,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:42:15,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:42:21,079][__main__][INFO] - Number of regex retries in iteration 143: 0 [2026-03-25 16:42:21,081][__main__][INFO] - agents played in iteration 143 are Bob, Alice [2026-03-25 16:42:21,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:42:21,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:42:21,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:42:21,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:42:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:42:22,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:42:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:42:25,043][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:42:26,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:42:26,777][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:42:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:42:28,216][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:42:28,936][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:42:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:42:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:42:31,094][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:42:31,813][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:42:32,535][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:42:33,254][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:42:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:42:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:42:35,413][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:42:36,131][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:42:36,851][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:42:37,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:42:38,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:42:39,008][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:42:39,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:42:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:42:41,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:42:41,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:42:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:42:43,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:42:44,046][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:42:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:42:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:42:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:42:46,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:42:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:42:48,365][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:42:49,085][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:42:49,803][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:42:50,525][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:42:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:42:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:42:52,686][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:42:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:42:54,125][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:42:54,847][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:42:55,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:42:56,287][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:42:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:42:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:42:58,676][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:42:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:43:00,117][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:43:00,841][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:43:01,564][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:43:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:43:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:43:03,728][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:43:04,450][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:43:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:43:05,895][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:43:06,615][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:43:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:43:08,059][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:43:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:43:09,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:43:10,231][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:43:11,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:43:11,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:43:11,340][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:43:12,899][__main__][INFO] - Iteration 144 took 57s (9.09% Gen, 88.17% Train). Generation: 5s, Training: 50s. Estimated remaining time: 13h 31m 26s. Estimated total time: 15h 50m 5s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 0s, 500 more iterations: 7h 55m 2s. [2026-03-25 16:43:12,902][__main__][INFO] - Starting iteration 144. [2026-03-25 16:43:12,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:43:12,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:18,191][__main__][INFO] - Number of regex retries in iteration 144: 0 [2026-03-25 16:43:18,193][__main__][INFO] - agents played in iteration 144 are Bob, Alice [2026-03-25 16:43:18,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:43:18,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:43:18,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:43:18,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:43:19,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:20,075][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:20,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:43:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:43:22,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:43:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:43:23,677][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:43:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:43:25,117][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:43:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:43:26,559][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:43:27,278][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:43:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:43:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:43:29,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:43:30,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:43:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:43:31,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:43:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:43:33,038][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:43:33,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:43:34,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:43:35,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:43:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:43:36,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:43:37,365][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:43:38,087][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:43:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:43:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:43:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:43:40,972][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:43:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:43:42,414][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:43:43,136][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:43:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:43:44,578][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:43:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:43:46,021][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:43:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:43:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:43:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:43:48,900][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:43:49,620][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:43:50,340][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:43:51,060][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:43:51,779][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:43:52,500][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:43:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:43:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:43:54,886][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:43:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:43:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:43:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:43:57,766][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:43:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:43:59,207][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:43:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:44:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:44:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:44:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:44:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:44:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:44:04,251][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:44:04,972][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:44:05,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:44:06,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:44:07,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:44:07,789][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:44:07,792][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:44:09,101][__main__][INFO] - Iteration 145 took 56s (9.41% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 17m 1s. Estimated total time: 15h 36m 36s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 18s. [2026-03-25 16:44:09,105][__main__][INFO] - Starting iteration 145. [2026-03-25 16:44:09,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:44:09,111][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:44:14,257][__main__][INFO] - Number of regex retries in iteration 145: 0 [2026-03-25 16:44:14,260][__main__][INFO] - agents played in iteration 145 are Bob, Alice [2026-03-25 16:44:14,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:44:14,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:44:14,825][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:44:14,826][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:44:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:44:16,154][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:44:16,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:44:17,590][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:44:18,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:44:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:44:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:44:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:21,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:21,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:23,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:24,054][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:24,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:25,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:26,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:27,646][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:44:28,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:44:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:44:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:44:30,521][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:44:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:44:31,958][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:44:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:44:33,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:44:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:44:34,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:44:35,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:44:36,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:44:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:44:37,713][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:44:38,434][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:44:39,154][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:44:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:44:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:44:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:44:42,037][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:44:42,756][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:44:43,476][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:44:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:44:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:44:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:44:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:44:47,076][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:44:47,796][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:44:48,516][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:44:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:44:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:44:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:44:51,661][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:44:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:44:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:44:53,823][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:44:54,542][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:44:55,262][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:44:55,986][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:44:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:44:57,427][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:44:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:44:58,867][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:44:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:45:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:45:01,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:45:02,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:45:03,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:45:03,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:45:03,602][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:45:04,944][__main__][INFO] - Iteration 146 took 55s (9.22% Gen, 88.37% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 10m 6s. Estimated total time: 15h 30m 37s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 18s. [2026-03-25 16:45:04,947][__main__][INFO] - Starting iteration 146. [2026-03-25 16:45:04,951][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:45:04,952][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:45:10,151][__main__][INFO] - Number of regex retries in iteration 146: 0 [2026-03-25 16:45:10,152][__main__][INFO] - agents played in iteration 146 are Bob, Alice [2026-03-25 16:45:10,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:45:10,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:45:10,802][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:45:10,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:45:11,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:45:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:45:12,849][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:45:13,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:45:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:45:15,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:45:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:45:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:45:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:45:17,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:45:18,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:45:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:45:20,041][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:45:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:45:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:45:22,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:45:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:45:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:45:24,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:45:25,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:29,399][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:32,289][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:33,012][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:45:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:45:34,458][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:45:35,179][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:45:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:45:36,624][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:45:37,344][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:45:38,065][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:45:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:45:39,509][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:45:40,233][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:45:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:45:41,675][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:45:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:45:43,117][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:45:43,838][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:45:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:45:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:45:46,249][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:45:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:45:47,694][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:45:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:45:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:45:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:45:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:45:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:45:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:45:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:45:53,474][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:45:54,195][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:45:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:45:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:56,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:45:57,080][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:45:57,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:45:58,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:45:59,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:45:59,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:45:59,574][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:46:01,024][__main__][INFO] - Iteration 147 took 56s (9.27% Gen, 88.13% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 13m 8s. Estimated total time: 15h 34m 34s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 17s. [2026-03-25 16:46:01,027][__main__][INFO] - Starting iteration 147. [2026-03-25 16:46:01,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:46:01,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:46:06,212][__main__][INFO] - Number of regex retries in iteration 147: 0 [2026-03-25 16:46:06,214][__main__][INFO] - agents played in iteration 147 are Bob, Alice [2026-03-25 16:46:06,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:46:06,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:46:06,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:46:06,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:46:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:46:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:46:08,827][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:46:09,545][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:46:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:46:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:46:11,705][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:46:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:46:13,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:46:13,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:46:14,583][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:46:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:46:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:46:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:46:17,462][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:46:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:46:18,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:46:19,620][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:46:20,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:46:21,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:46:21,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:46:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:46:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:46:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:46:24,657][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:46:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:46:26,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:46:26,815][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:46:27,536][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:46:28,255][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:46:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:46:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:46:30,415][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:46:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:31,855][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:32,576][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:33,296][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:35,106][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:36,549][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:37,267][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:46:38,708][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:46:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:46:40,145][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:46:40,865][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:46:41,583][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:46:42,540][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:46:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:46:43,978][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:46:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:46:45,417][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:46:46,136][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:46:46,857][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:46:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:46:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:46:49,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:46:49,731][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:46:50,450][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:46:51,170][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:46:51,890][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:46:52,609][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:46:53,328][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:46:54,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:46:54,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:46:56,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:46:56,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:46:56,150][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:46:57,524][__main__][INFO] - Iteration 148 took 56s (9.17% Gen, 88.39% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 19m 12s. Estimated total time: 15h 41m 35s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 47s. [2026-03-25 16:46:57,527][__main__][INFO] - Starting iteration 148. [2026-03-25 16:46:57,531][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:46:57,532][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:47:02,608][__main__][INFO] - Number of regex retries in iteration 148: 0 [2026-03-25 16:47:02,609][__main__][INFO] - agents played in iteration 148 are Bob, Alice [2026-03-25 16:47:03,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:47:03,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:47:03,152][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:47:03,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:47:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:47:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:47:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:47:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:47:06,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:47:07,378][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:47:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:47:08,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:47:09,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:47:10,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:47:10,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:47:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:47:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:47:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:47:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:47:14,550][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:47:15,268][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:47:15,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:47:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:47:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:47:18,137][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:47:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:47:19,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:47:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:47:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:47:21,724][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:47:22,442][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:47:23,161][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:47:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:47:24,596][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:47:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:47:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:47:26,751][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:47:27,468][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:47:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:47:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:47:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:47:30,344][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:47:31,065][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:47:31,785][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:47:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:47:33,221][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:47:33,941][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:47:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:47:35,378][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:47:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:47:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:47:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:47:38,494][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:47:39,214][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:39,934][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:40,652][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:42,091][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:47:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:47:43,531][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:47:44,251][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:47:44,969][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:47:45,688][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:47:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:47:47,129][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:47:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:47:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:47:49,289][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:47:50,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:47:50,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:47:51,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:47:51,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:47:51,933][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:47:53,283][__main__][INFO] - Iteration 149 took 55s (9.11% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 5m 55s. Estimated total time: 15h 29m 13s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 36s. [2026-03-25 16:47:53,285][__main__][INFO] - Starting iteration 149. [2026-03-25 16:47:53,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:47:53,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:47:58,334][__main__][INFO] - Number of regex retries in iteration 149: 0 [2026-03-25 16:47:58,335][__main__][INFO] - agents played in iteration 149 are Bob, Alice [2026-03-25 16:47:58,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:47:58,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:47:58,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:47:58,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:47:59,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:48:00,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:48:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:48:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:48:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:48:03,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:48:03,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:48:04,530][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:48:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:48:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:48:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:48:07,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:48:08,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:48:08,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:48:09,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:48:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:48:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:48:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:48:12,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:48:13,436][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:48:16,493][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:48:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:48:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:48:18,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:48:19,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:48:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:48:20,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:48:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:48:22,234][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:48:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:48:28,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:48:29,086][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:48:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:48:30,522][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:48:31,238][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:48:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:48:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:48:33,392][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:48:34,108][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:48:34,824][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:48:35,542][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:48:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:48:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:48:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:48:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:48:39,127][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:48:39,847][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:48:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:48:41,510][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:48:42,228][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:48:42,944][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:48:43,662][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:48:44,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:48:45,098][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:48:45,816][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:48:46,532][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:48:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:48:47,974][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:48:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:48:49,413][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:48:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:48:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:48:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:48:52,290][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:48:53,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:48:53,738][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:54 [2026-03-25 16:48:54,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:48:54,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:48:54,951][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:48:56,390][__main__][INFO] - Iteration 150 took 1m 3s (7.99% Gen, 89.72% Train). Generation: 5s, Training: 56s. Estimated remaining time: 15h 7m 20s. Estimated total time: 17h 31m 42s. Time estimates for 10 more iterations: 10m 31s, 100 more iterations: 1h 45m 10s, 500 more iterations: 8h 45m 51s. [2026-03-25 16:48:56,393][__main__][INFO] - Starting iteration 150. [2026-03-25 16:48:56,396][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:48:56,397][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:01,569][__main__][INFO] - Number of regex retries in iteration 150: 0 [2026-03-25 16:49:01,570][__main__][INFO] - agents played in iteration 150 are Bob, Alice [2026-03-25 16:49:02,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:49:02,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:49:02,138][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:49:02,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:49:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:49:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:49:04,185][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:49:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:49:05,612][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:49:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:49:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:49:07,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:49:08,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:49:09,196][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:49:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:49:10,630][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:49:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:49:12,063][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:49:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:49:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:49:14,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:49:14,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:49:15,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:49:16,364][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:49:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:49:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:49:18,516][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:49:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:49:19,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:49:20,669][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:49:21,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:49:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:49:22,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:49:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:49:24,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:49:24,975][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:49:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:49:26,412][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:49:27,131][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:49:27,848][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:49:28,567][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:49:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:49:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:49:30,720][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:49:31,437][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:49:32,155][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:49:32,873][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:49:33,592][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:49:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:49:35,029][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:49:35,747][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:49:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:49:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:49:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:49:38,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:49:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:49:40,353][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:49:41,073][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:49:41,791][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:49:42,510][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:49:43,230][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:49:43,948][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:49:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:49:45,385][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:49:46,104][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:49:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:49:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:49:48,262][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:49:48,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:49:49,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:49:50,816][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:49:50,819][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:49:50,821][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:49:53,707][__main__][INFO] - Iteration 151 took 57s (9.03% Gen, 85.93% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 29m 53s. Estimated total time: 15h 55m 12s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 31s, 500 more iterations: 7h 57m 36s. [2026-03-25 16:49:53,711][__main__][INFO] - Starting iteration 151. [2026-03-25 16:49:53,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:49:53,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:58,810][__main__][INFO] - Number of regex retries in iteration 151: 0 [2026-03-25 16:49:58,811][__main__][INFO] - agents played in iteration 151 are Bob, Alice [2026-03-25 16:49:59,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:49:59,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:49:59,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:49:59,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:50:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:00,702][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:01,419][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:02,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:50:04,286][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:50:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:50:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:50:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:50:07,156][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:50:08,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:50:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:50:10,032][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:50:10,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:50:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:50:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:50:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:50:13,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:50:14,343][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:50:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:50:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:50:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:50:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:50:17,936][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:50:18,655][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:50:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:50:20,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:50:20,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:50:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:50:22,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:50:22,973][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:50:23,694][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:50:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:50:25,132][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:50:25,853][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:50:26,572][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:50:27,291][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:50:28,012][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:50:28,732][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:50:29,452][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:50:30,173][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:50:30,893][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:50:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:50:32,334][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:50:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:50:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:50:34,734][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:50:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:50:36,178][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:50:36,899][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:50:37,622][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:50:38,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:50:39,065][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:50:39,786][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:50:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:50:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:50:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:50:42,673][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:50:43,396][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:50:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:50:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:50:45,561][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:50:46,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:50:47,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:50:48,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:50:48,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:50:48,021][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:50:49,433][__main__][INFO] - Iteration 152 took 55s (9.14% Gen, 88.32% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 2m 23s. Estimated total time: 15h 28m 38s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 19s. [2026-03-25 16:50:49,435][__main__][INFO] - Starting iteration 152. [2026-03-25 16:50:49,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:50:49,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:50:54,703][__main__][INFO] - Number of regex retries in iteration 152: 0 [2026-03-25 16:50:54,704][__main__][INFO] - agents played in iteration 152 are Bob, Alice [2026-03-25 16:50:55,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:50:55,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:50:55,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:50:55,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:50:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:58,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:59,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:00,276][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:00,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:04,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:05,308][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:06,028][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:06,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:07,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:51:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:51:08,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:51:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:51:10,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:51:11,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:51:11,794][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:51:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:51:13,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:51:13,955][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:51:14,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:51:15,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:51:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:51:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:51:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:51:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:51:18,988][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:51:19,705][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:51:20,423][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:51:21,143][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:51:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:51:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:51:23,300][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:51:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:51:24,738][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:51:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:51:26,176][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:51:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:51:27,615][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:51:28,334][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:51:29,054][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:51:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:51:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:51:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:51:32,165][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:51:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:51:33,605][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:51:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:51:35,043][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:51:35,763][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:51:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:51:37,202][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:51:37,922][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:51:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:51:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:51:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:51:40,804][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:51:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:51:42,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:51:42,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:51:43,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:51:43,967][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:51:43,969][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:51:45,301][__main__][INFO] - Iteration 153 took 55s (9.42% Gen, 88.19% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 3m 53s. Estimated total time: 15h 31m 4s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 32s. [2026-03-25 16:51:45,304][__main__][INFO] - Starting iteration 153. [2026-03-25 16:51:45,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:51:45,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:51:50,513][__main__][INFO] - Number of regex retries in iteration 153: 0 [2026-03-25 16:51:50,515][__main__][INFO] - agents played in iteration 153 are Bob, Alice [2026-03-25 16:51:51,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:51:51,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:51:51,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:51:51,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:51:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:51:52,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:51:53,197][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:51:53,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:51:54,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:51:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:56,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:57,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:58,941][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:59,659][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:52:00,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:52:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:52:01,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:52:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:52:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:52:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:52:04,690][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:52:05,410][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:52:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:52:06,849][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:07,567][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:52:11,887][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:52:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:52:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:52:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:52:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:52:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:52:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:52:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:52:17,655][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:52:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:52:19,097][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:52:19,818][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:52:20,540][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:52:21,261][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:52:21,983][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:52:22,703][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:52:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:52:24,147][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:52:24,870][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:52:25,593][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:52:26,652][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:52:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:52:28,096][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:52:28,817][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:52:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:52:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:52:30,984][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:52:31,707][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:52:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:52:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:52:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:52:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:52:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:52:36,040][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:52:36,762][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:52:37,483][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:52:38,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:52:38,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:52:39,897][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:52:39,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:52:39,902][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:52:41,228][__main__][INFO] - Iteration 154 took 55s (9.31% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 3m 55s. Estimated total time: 15h 32m 1s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 0s. [2026-03-25 16:52:41,231][__main__][INFO] - Starting iteration 154. [2026-03-25 16:52:41,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:52:41,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:52:46,482][__main__][INFO] - Number of regex retries in iteration 154: 0 [2026-03-25 16:52:46,484][__main__][INFO] - agents played in iteration 154 are Bob, Alice [2026-03-25 16:52:47,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:52:48,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:52:52,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:52:52,414][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:52:53,103][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:52:53,750][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:52:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:52:55,183][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:52:55,901][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:52:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:52:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:52:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:52:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:52:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:53:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:53:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:53:01,635][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:53:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:53:03,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:53:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:53:04,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:53:05,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:53:05,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:53:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:53:07,378][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:53:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:53:08,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:53:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:53:10,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:53:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:53:11,697][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:53:12,416][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:53:13,134][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:15,292][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:53:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:53:17,454][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:53:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:53:18,895][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:53:19,614][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:53:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:53:21,055][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:53:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:53:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:53:23,217][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:53:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:53:24,659][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:53:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:53:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:53:26,822][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:53:27,775][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:53:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:53:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:53:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:53:30,662][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:53:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:53:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:53:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:53:33,547][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:53:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:53:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:53:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:53:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:53:37,157][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:53:37,879][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:53:38,601][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:53:39,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:53:40,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 16:53:41,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:53:41,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:53:41,074][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:53:42,568][__main__][INFO] - Iteration 155 took 1m 1s (8.56% Gen, 89.00% Train). Generation: 5s, Training: 54s. Estimated remaining time: 14h 33m 5s. Estimated total time: 17h 2m 13s. Time estimates for 10 more iterations: 10m 13s, 100 more iterations: 1h 42m 13s, 500 more iterations: 8h 31m 6s. [2026-03-25 16:53:42,571][__main__][INFO] - Starting iteration 155. [2026-03-25 16:53:42,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:53:42,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:53:48,843][__main__][INFO] - Number of regex retries in iteration 155: 0 [2026-03-25 16:53:48,844][__main__][INFO] - agents played in iteration 155 are Bob, Alice [2026-03-25 16:53:49,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:53:49,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:53:49,388][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:53:49,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:53:50,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:53:50,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:53:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:53:52,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:53:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:53:53,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:53:54,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:53:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:53:55,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:53:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:53:57,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:53:57,922][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:53:58,639][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:53:59,358][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:54:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:54:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:54:01,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:54:02,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:54:02,953][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:54:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:54:04,390][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:54:05,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:54:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:54:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:54:07,269][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:54:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:54:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:54:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:54:10,148][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:54:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:54:11,588][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:54:12,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:54:13,029][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:54:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:54:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:54:15,189][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:54:15,907][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:54:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:17,347][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:18,065][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:18,784][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:20,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:54:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:54:22,384][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:54:27,339][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:54:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:54:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:54:30,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:54:31,207][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:54:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:54:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:54:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:54:34,087][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:54:34,808][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:54:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:54:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:54:36,970][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:54:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:54:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:54:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:54:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:54:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:54:41,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:54:42,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 16:54:43,088][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:54:43,094][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:54:43,095][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:54:44,465][__main__][INFO] - Iteration 156 took 1m 1s (10.13% Gen, 87.65% Train). Generation: 6s, Training: 54s. Estimated remaining time: 14h 41m 21s. Estimated total time: 17h 11m 32s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 9s, 500 more iterations: 8h 35m 46s. [2026-03-25 16:54:44,468][__main__][INFO] - Starting iteration 156. [2026-03-25 16:54:44,472][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:54:44,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:54:49,584][__main__][INFO] - Number of regex retries in iteration 156: 0 [2026-03-25 16:54:49,585][__main__][INFO] - agents played in iteration 156 are Bob, Alice [2026-03-25 16:54:50,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:54:50,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:54:50,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:54:50,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:54:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:54:51,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:54:52,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:54:52,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:54:53,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:54:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:54:55,067][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:54:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:54:56,505][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:54:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:54:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:54:58,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:54:59,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:00,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:01,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:02,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:02,978][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:03,697][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:55:04,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:55:05,137][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:55:05,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:55:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:55:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:55:08,018][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:55:08,741][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:55:09,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:55:10,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:55:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:55:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:55:12,340][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:55:13,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:55:13,780][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:55:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:55:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:55:15,939][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:55:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:55:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:55:18,101][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:55:18,820][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:55:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:55:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:55:20,979][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:55:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:55:22,419][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:55:23,860][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:55:24,582][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:55:25,570][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:55:26,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:55:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:55:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:55:28,454][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:55:29,174][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:55:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:55:30,617][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:55:31,339][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:55:32,058][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:55:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:55:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:55:34,219][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:55:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:55:35,661][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:55:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:55:37,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:55:37,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:55:38,807][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:55:38,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:55:38,812][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:55:40,224][__main__][INFO] - Iteration 157 took 55s (9.17% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 58m 7s. Estimated total time: 15h 29m 13s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 36s. [2026-03-25 16:55:40,226][__main__][INFO] - Starting iteration 157. [2026-03-25 16:55:40,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:55:40,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:55:45,338][__main__][INFO] - Number of regex retries in iteration 157: 0 [2026-03-25 16:55:45,339][__main__][INFO] - agents played in iteration 157 are Bob, Alice [2026-03-25 16:55:45,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:55:45,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:55:45,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:55:45,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:55:46,561][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:55:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:55:47,934][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:55:48,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:55:49,376][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:55:50,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:55:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:55:51,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:55:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:55:52,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:55:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:55:54,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:55:55,124][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:57,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:58,002][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:58,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:56:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:56:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:56:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:56:02,320][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:56:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:56:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:56:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:56:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:56:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:56:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:56:08,079][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:56:08,799][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:56:09,520][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:56:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:56:10,960][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:56:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:56:12,402][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:56:13,121][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:56:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:56:14,564][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:56:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:56:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:56:16,727][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:56:17,448][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:56:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:56:18,890][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:56:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:56:20,332][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:56:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:56:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:56:22,724][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:56:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:56:24,167][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:56:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:56:25,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:56:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:56:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:56:27,775][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:56:28,495][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:56:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:56:29,942][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:56:30,663][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:56:31,384][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:56:32,106][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:56:32,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:56:33,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:56:34,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:56:34,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:56:34,667][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:56:36,093][__main__][INFO] - Iteration 158 took 55s (9.14% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 59m 2s. Estimated total time: 15h 31m 4s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 32s. [2026-03-25 16:56:36,095][__main__][INFO] - Starting iteration 158. [2026-03-25 16:56:36,100][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:56:36,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:56:41,158][__main__][INFO] - Number of regex retries in iteration 158: 0 [2026-03-25 16:56:41,160][__main__][INFO] - agents played in iteration 158 are Bob, Alice [2026-03-25 16:56:41,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:56:41,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:56:41,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:56:41,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:56:42,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:56:43,061][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:56:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:56:44,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:56:45,221][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:56:45,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:56:46,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:56:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:56:48,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:56:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:56:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:56:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:56:50,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:56:51,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:56:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:56:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:56:53,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:56:54,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:56:55,302][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:56:56,021][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:56:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:56:57,466][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:56:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:56:58,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:59,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:57:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:57:01,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:57:01,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:57:02,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:57:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:57:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:57:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:57:05,397][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:57:06,119][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:57:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:57:07,561][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:57:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:57:09,004][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:57:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:57:10,448][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:57:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:57:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:57:12,610][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:57:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:57:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:57:14,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:57:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:57:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:57:17,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:57:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:57:19,337][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:57:20,059][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:57:20,782][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:57:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:57:22,226][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:57:22,948][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:57:23,673][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:57:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:57:25,117][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:57:25,840][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:57:26,565][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:57:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:57:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:57:28,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:57:29,671][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:57:30,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:57:30,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:57:30,940][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:57:32,275][__main__][INFO] - Iteration 159 took 56s (9.01% Gen, 88.61% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 3m 19s. Estimated total time: 15h 36m 17s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 8s. [2026-03-25 16:57:32,278][__main__][INFO] - Starting iteration 159. [2026-03-25 16:57:32,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:57:32,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:57:39,169][__main__][INFO] - Number of regex retries in iteration 159: 0 [2026-03-25 16:57:39,170][__main__][INFO] - agents played in iteration 159 are Bob, Alice [2026-03-25 16:57:39,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:57:39,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:57:39,734][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:57:39,735][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:57:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:57:41,074][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:57:41,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:57:42,517][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:57:43,238][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:57:43,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:57:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:57:45,399][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:57:46,120][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:57:46,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:57:47,560][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:57:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:57:49,003][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:57:49,723][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:57:50,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:57:51,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:57:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:57:52,609][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:57:53,330][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:57:54,051][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:57:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:57:55,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:57:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:57:56,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:57:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:57:58,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:57:59,101][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:57:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:58:00,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:58:01,267][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:58:01,988][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:58:02,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:58:03,431][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:58:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:58:04,877][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:58:05,597][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:58:06,321][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:58:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:58:07,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:58:08,487][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:58:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:58:09,930][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:58:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:58:11,378][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:58:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:58:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:58:13,545][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:58:14,267][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:58:15,302][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:58:16,025][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:58:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:58:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:58:18,194][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:58:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:58:19,639][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:58:20,362][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:58:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:58:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:58:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:58:23,253][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:58:23,976][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:58:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:58:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:58:26,146][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:58:26,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:58:27,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:58:28,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:58:28,769][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:58:28,771][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:58:30,855][__main__][INFO] - Iteration 160 took 58s (11.76% Gen, 84.68% Train). Generation: 6s, Training: 49s. Estimated remaining time: 13h 42m 18s. Estimated total time: 16h 16m 14s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 37s, 500 more iterations: 8h 8m 7s. [2026-03-25 16:58:30,859][__main__][INFO] - Starting iteration 160. [2026-03-25 16:58:30,866][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:58:30,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:58:39,264][__main__][INFO] - Number of regex retries in iteration 160: 0 [2026-03-25 16:58:39,265][__main__][INFO] - agents played in iteration 160 are Bob, Alice [2026-03-25 16:58:39,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:58:39,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:58:39,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:58:39,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:58:40,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:58:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:58:41,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:58:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:58:43,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:58:44,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:58:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:58:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:58:46,269][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:58:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:58:47,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:58:48,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:58:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:58:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:58:50,582][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:58:51,300][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:58:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:58:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:58:53,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:58:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:58:54,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:58:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:58:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:58:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:58:57,781][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:58:58,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:58:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:58:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:59:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:59:01,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:59:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:59:02,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:59:03,544][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:59:04,265][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:59:04,985][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:59:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:59:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:59:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:59:07,869][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:59:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:59:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:59:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:59:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:59:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:59:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:59:12,919][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:59:13,641][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:59:14,363][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:59:15,309][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:59:16,032][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:59:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:59:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:59:18,196][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:59:18,920][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:59:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:59:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:59:21,085][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:59:21,807][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:59:22,529][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:59:23,253][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:59:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:59:24,697][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:59:25,419][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:59:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:59:26,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:59:27,598][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 16:59:28,576][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:59:28,579][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:59:28,581][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:59:30,328][__main__][INFO] - Iteration 161 took 59s (14.12% Gen, 82.93% Train). Generation: 8s, Training: 49s. Estimated remaining time: 13h 56m 8s. Estimated total time: 16h 31m 4s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 6s, 500 more iterations: 8h 15m 32s. [2026-03-25 16:59:30,332][__main__][INFO] - Starting iteration 161. [2026-03-25 16:59:30,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:59:30,341][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:59:35,500][__main__][INFO] - Number of regex retries in iteration 161: 0 [2026-03-25 16:59:35,502][__main__][INFO] - agents played in iteration 161 are Bob, Alice [2026-03-25 16:59:36,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:59:36,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 16:59:36,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 16:59:36,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 16:59:36,779][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:59:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:59:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:59:38,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:59:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:59:40,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:59:41,030][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:59:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:59:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:59:43,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:59:43,912][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:59:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:59:45,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:59:46,072][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:59:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:59:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:59:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:59:48,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:59:49,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:59:50,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:59:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:59:51,842][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:59:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:59:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:59:54,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:59:54,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:59:55,449][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:59:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:59:56,892][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:59:57,614][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:59:58,336][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:59:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:59:59,780][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:00:00,501][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:00:01,223][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:00:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:00:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:00:03,388][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:00:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:00:04,830][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:00:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:00:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:00:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:00:07,721][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:00:08,444][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:00:09,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:00:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:00:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:00:11,558][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:00:12,282][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:00:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:00:13,725][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:00:14,449][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:00:15,172][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:00:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:00:16,617][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:00:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:00:18,062][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:00:18,787][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:00:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:00:20,234][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:00:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:00:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:00:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:00:23,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:00:23,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:00:25,388][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:00:25,393][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:00:25,396][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:00:27,048][__main__][INFO] - Iteration 162 took 56s (9.10% Gen, 87.98% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 9m 19s. Estimated total time: 15h 45m 12s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 31s, 500 more iterations: 7h 52m 36s. [2026-03-25 17:00:27,051][__main__][INFO] - Starting iteration 162. [2026-03-25 17:00:27,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:00:27,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:00:32,209][__main__][INFO] - Number of regex retries in iteration 162: 0 [2026-03-25 17:00:32,210][__main__][INFO] - agents played in iteration 162 are Bob, Alice [2026-03-25 17:00:32,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:00:32,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:00:32,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:00:32,844][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:00:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:00:34,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:00:34,914][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:00:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:00:36,352][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:00:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:00:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:00:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:00:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:00:39,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:00:40,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:00:41,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:00:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:00:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:00:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:00:44,288][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:00:45,012][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:00:46,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:00:47,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:00:47,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:00:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:00:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:00:50,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:00:50,790][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:00:51,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:00:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:00:52,958][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:00:53,678][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:00:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:00:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:00:55,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:00:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:00:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:00:58,015][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:00:58,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:00:59,460][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:01:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:01:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:01:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:01:02,354][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:01:03,078][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:01:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:01:04,523][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:01:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:01:05,969][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:01:06,693][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:01:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:01:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:01:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:01:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:01:10,657][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:01:11,382][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:01:12,106][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:01:12,830][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:01:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:01:14,281][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:01:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:01:15,733][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:01:16,459][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:01:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:01:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:01:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:01:19,362][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:01:20,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:01:20,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:01:21,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:01:21,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:01:21,914][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:01:23,243][__main__][INFO] - Iteration 163 took 56s (9.17% Gen, 88.46% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 59m 41s. Estimated total time: 15h 36m 30s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 15s. [2026-03-25 17:01:23,247][__main__][INFO] - Starting iteration 163. [2026-03-25 17:01:23,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:01:23,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:01:32,518][__main__][INFO] - Number of regex retries in iteration 163: 0 [2026-03-25 17:01:32,519][__main__][INFO] - agents played in iteration 163 are Bob, Alice [2026-03-25 17:01:33,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:01:33,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:01:33,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:01:33,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:01:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:01:34,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:01:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:01:35,856][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:01:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:01:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:01:38,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:01:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:01:39,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:01:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:01:40,896][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:01:41,614][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:01:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:01:43,052][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:01:43,770][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:01:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:01:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:01:45,928][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:01:46,649][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:01:47,369][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:01:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:01:48,808][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:01:49,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:01:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:01:50,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:01:51,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:01:52,407][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:01:53,127][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:01:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:54,568][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:55,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:56,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:56,728][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:57,453][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:58,174][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:01:58,895][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:01:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:01,063][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:02,508][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:03,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:03,951][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:06,122][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:06,842][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:07,566][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:08,519][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:02:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:02:10,683][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:02:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:02:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:02:12,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:02:13,572][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:02:14,293][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:02:15,016][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:02:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:02:16,463][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:02:17,187][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:02:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:02:18,631][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:02:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:02:20,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:02:20,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:02:21,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:02:21,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:02:21,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:02:23,360][__main__][INFO] - Iteration 164 took 1m 0s (15.42% Gen, 82.19% Train). Generation: 9s, Training: 49s. Estimated remaining time: 14h 4m 1s. Estimated total time: 16h 41m 50s. Time estimates for 10 more iterations: 10m 1s, 100 more iterations: 1h 40m 11s, 500 more iterations: 8h 20m 55s. [2026-03-25 17:02:23,364][__main__][INFO] - Starting iteration 164. [2026-03-25 17:02:23,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:02:23,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:02:28,477][__main__][INFO] - Number of regex retries in iteration 164: 0 [2026-03-25 17:02:28,478][__main__][INFO] - agents played in iteration 164 are Bob, Alice [2026-03-25 17:02:28,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:02:29,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:02:29,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:02:29,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:02:29,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:02:30,366][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:02:31,090][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:02:31,810][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:02:32,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:02:33,248][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:02:33,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:02:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:02:35,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:02:36,125][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:02:36,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:02:37,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:02:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:02:39,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:02:39,725][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:02:40,444][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:02:41,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:02:41,885][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:02:42,605][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:02:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:02:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:02:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:02:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:02:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:02:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:02:47,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:02:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:02:49,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:02:49,814][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:02:50,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:02:51,252][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:02:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:02:52,693][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:02:53,416][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:02:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:02:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:02:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:56,299][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:57,020][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:57,742][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:58,463][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:59,186][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:59,907][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:03:00,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:03:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:03:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:03:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:03:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:03:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:03:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:03:05,903][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:03:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:03:07,347][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:03:08,067][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:03:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:03:09,511][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:03:10,233][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:03:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:03:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:03:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:03:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:03:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:03:14,564][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:03:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:03:16,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:03:16,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:03:17,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:03:17,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:03:17,794][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:03:19,225][__main__][INFO] - Iteration 165 took 55s (9.14% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 52m 12s. Estimated total time: 15h 30m 57s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 28s. [2026-03-25 17:03:19,228][__main__][INFO] - Starting iteration 165. [2026-03-25 17:03:19,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:03:19,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:03:24,390][__main__][INFO] - Number of regex retries in iteration 165: 0 [2026-03-25 17:03:24,391][__main__][INFO] - agents played in iteration 165 are Bob, Alice [2026-03-25 17:03:24,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:03:24,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:03:24,947][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:03:24,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:03:25,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:03:26,307][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:03:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:03:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:03:28,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:03:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:03:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:03:30,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:03:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:03:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:03:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:03:33,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:03:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:03:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:03:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:03:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:03:37,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:03:37,819][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:03:38,540][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:03:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:03:39,979][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:03:40,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:03:41,418][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:03:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:03:42,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:03:43,581][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:03:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:03:45,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:03:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:03:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:03:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:03:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:03:48,631][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:03:49,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:03:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:03:50,789][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:03:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:03:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:03:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:03:53,671][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:03:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:03:55,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:03:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:03:56,555][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:03:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:03:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:03:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:03:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:04:00,504][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:04:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:01,945][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:02,665][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:04:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:04:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:04:06,267][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:04:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:04:07,706][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:04:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:04:09,148][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:04:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:12,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:04:12,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:04:13,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:04:13,757][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:04:13,758][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:04:15,135][__main__][INFO] - Iteration 166 took 55s (9.23% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 52m 4s. Estimated total time: 15h 31m 45s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 10s, 500 more iterations: 7h 45m 52s. [2026-03-25 17:04:15,138][__main__][INFO] - Starting iteration 166. [2026-03-25 17:04:15,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:04:15,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:04:20,125][__main__][INFO] - Number of regex retries in iteration 166: 0 [2026-03-25 17:04:20,126][__main__][INFO] - agents played in iteration 166 are Bob, Alice [2026-03-25 17:04:20,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:04:20,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:04:20,675][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:04:20,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:04:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:04:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:04:22,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:04:23,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:04:24,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:04:24,932][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:04:25,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:04:26,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:04:27,090][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:04:27,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:04:28,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:04:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:04:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:04:30,692][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:04:31,411][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:04:32,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:04:32,852][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:04:33,572][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:04:34,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:04:35,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:04:35,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:04:36,452][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:04:37,173][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:04:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:04:38,613][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:04:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:04:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:04:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:04:41,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:04:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:04:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:04:43,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:04:44,363][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:04:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:04:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:04:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:04:47,239][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:04:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:04:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:04:49,398][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:04:50,116][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:04:50,835][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:04:51,556][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:04:52,276][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:04:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:04:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:04:54,433][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:04:55,152][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:04:56,101][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:04:56,822][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:57,542][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:58,261][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:05:00,418][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:05:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:05:01,858][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:05:02,577][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:05:03,297][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:05:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:05:04,736][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:05:05,457][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:05:06,178][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:05:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:05:07,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:05:08,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:05:09,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:05:09,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:05:09,303][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:05:10,652][__main__][INFO] - Iteration 167 took 55s (8.97% Gen, 88.59% Train). Generation: 4s, Training: 49s. Estimated remaining time: 12h 44m 32s. Estimated total time: 15h 25m 9s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 34s. [2026-03-25 17:05:10,655][__main__][INFO] - Starting iteration 167. [2026-03-25 17:05:10,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:05:10,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:15,708][__main__][INFO] - Number of regex retries in iteration 167: 0 [2026-03-25 17:05:15,709][__main__][INFO] - agents played in iteration 167 are Bob, Alice [2026-03-25 17:05:16,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:05:16,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:05:16,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:05:16,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:05:16,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:05:17,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:05:18,348][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:05:19,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:05:19,787][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:05:20,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:05:21,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:05:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:05:22,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:05:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:05:24,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:05:24,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:05:25,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:05:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:05:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:05:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:05:28,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:05:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:05:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:05:30,573][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:05:31,292][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:05:32,013][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:05:32,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:05:33,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:05:34,172][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:05:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:05:35,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:05:36,330][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:05:37,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:05:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:05:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:05:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:05:39,931][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:05:40,651][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:05:41,370][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:05:42,090][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:05:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:05:43,531][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:05:44,250][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:05:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:05:45,690][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:05:46,410][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:05:47,127][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:05:47,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:05:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:05:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:05:50,005][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:05:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:05:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:05:52,392][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:05:53,111][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:05:53,830][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:05:54,548][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:05:55,269][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:05:55,987][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:05:56,707][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:05:57,427][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:05:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:05:58,866][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:05:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:06:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:06:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:06:01,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:06:02,462][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:06:03,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:06:03,912][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:06:05,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:06:05,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:06:05,056][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:06:06,328][__main__][INFO] - Iteration 168 took 55s (9.07% Gen, 88.64% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 46m 19s. Estimated total time: 15h 27m 51s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 55s. [2026-03-25 17:06:06,332][__main__][INFO] - Starting iteration 168. [2026-03-25 17:06:06,337][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:06:06,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:06:11,488][__main__][INFO] - Number of regex retries in iteration 168: 0 [2026-03-25 17:06:11,489][__main__][INFO] - agents played in iteration 168 are Bob, Alice [2026-03-25 17:06:12,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:06:12,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:06:12,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:06:12,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:06:12,827][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:06:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:06:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:06:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:06:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:06:16,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:06:17,067][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:17,783][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:19,220][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:21,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:06:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:06:24,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:06:24,968][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:06:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:06:26,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:06:27,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:06:27,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:06:28,563][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:06:29,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:06:30,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:06:30,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:06:31,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:06:32,156][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:06:32,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:06:33,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:06:34,315][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:06:35,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:06:35,762][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:06:36,485][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:06:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:06:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:06:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:06:43,480][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:06:44,200][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:06:44,919][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:06:45,638][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:06:46,359][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:06:47,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:06:47,798][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:06:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:06:49,239][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:06:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:06:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:06:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:06:52,456][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:06:53,177][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:06:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:06:54,618][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:06:55,338][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:06:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:06:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:06:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:06:58,222][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:06:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:06:59,666][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:07:00,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:07:01,109][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:07:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:07:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:07:03,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:07:04,047][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 17:07:05,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:07:05,161][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:07:05,165][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:07:06,572][__main__][INFO] - Iteration 169 took 1m 0s (8.55% Gen, 89.11% Train). Generation: 5s, Training: 53s. Estimated remaining time: 14h 1m 25s. Estimated total time: 16h 43m 57s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 23s, 500 more iterations: 8h 21m 58s. [2026-03-25 17:07:06,575][__main__][INFO] - Starting iteration 169. [2026-03-25 17:07:06,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:07:06,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:07:11,789][__main__][INFO] - Number of regex retries in iteration 169: 0 [2026-03-25 17:07:11,790][__main__][INFO] - agents played in iteration 169 are Bob, Alice [2026-03-25 17:07:12,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:07:12,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:07:12,343][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:07:12,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:07:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:07:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:07:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:07:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:07:15,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:07:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:07:17,250][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:07:17,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:07:18,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:07:19,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:07:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:07:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:07:21,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:07:22,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:07:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:24,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:25,141][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:28,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:28,727][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:29,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:30,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:31,598][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:07:32,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:07:33,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:07:33,753][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:07:34,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:07:35,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:07:35,906][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:07:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:07:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:07:38,060][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:07:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:07:39,498][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:07:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:07:40,935][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:07:41,653][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:07:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:07:43,092][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:07:43,812][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:07:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:07:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:07:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:07:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:07:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:07:48,354][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:07:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:07:49,792][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:07:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:07:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:07:51,948][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:07:52,667][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:07:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:07:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:07:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:07:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:07:56,262][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:07:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:07:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:07:58,420][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:07:59,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:07:59,877][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:08:04,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:08:04,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:08:04,090][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:08:05,589][__main__][INFO] - Iteration 170 took 59s (8.83% Gen, 88.63% Train). Generation: 5s, Training: 52s. Estimated remaining time: 13h 40m 1s. Estimated total time: 16h 23m 32s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 21s, 500 more iterations: 8h 11m 46s. [2026-03-25 17:08:05,592][__main__][INFO] - Starting iteration 170. [2026-03-25 17:08:05,596][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:08:05,596][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:08:10,753][__main__][INFO] - Number of regex retries in iteration 170: 0 [2026-03-25 17:08:10,754][__main__][INFO] - agents played in iteration 170 are Bob, Alice [2026-03-25 17:08:11,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:08:11,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:08:11,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:08:11,305][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:08:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:08:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:08:13,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:08:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:08:14,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:08:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:08:16,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:08:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:08:17,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:08:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:08:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:08:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:08:20,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:08:21,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:08:21,957][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:08:22,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:08:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:08:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:08:24,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:08:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:08:26,260][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:08:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:08:27,694][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:08:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:08:29,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:08:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:08:30,563][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:08:31,278][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:08:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:08:32,717][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:08:33,435][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:08:34,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:34,869][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:35,589][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:37,024][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:37,741][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:38,460][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:08:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:08:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:08:41,333][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:08:42,051][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:08:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:08:43,489][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:08:44,206][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:46,588][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:08:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:08:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:08:49,463][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:08:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:08:50,900][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:08:51,619][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:08:52,338][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:08:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:08:53,774][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:08:54,493][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:08:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:08:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:08:56,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:08:57,369][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:08:58,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:08:58,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:08:59,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:08:59,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:08:59,802][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:09:01,236][__main__][INFO] - Iteration 171 took 55s (9.27% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 42m 55s. Estimated total time: 15h 27m 22s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 41s. [2026-03-25 17:09:01,239][__main__][INFO] - Starting iteration 171. [2026-03-25 17:09:01,243][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:09:01,244][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:09:06,479][__main__][INFO] - Number of regex retries in iteration 171: 0 [2026-03-25 17:09:06,480][__main__][INFO] - agents played in iteration 171 are Bob, Alice [2026-03-25 17:09:06,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:09:07,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:09:07,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:09:07,027][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:09:07,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:09:08,352][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:09:09,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:09:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:09:10,502][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:09:11,221][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:09:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:09:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:09:13,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:09:14,090][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:09:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:09:15,527][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:09:16,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:09:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:09:17,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:09:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:09:19,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:09:19,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:09:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:09:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:09:21,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:09:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:09:23,436][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:09:24,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:09:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:09:25,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:09:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:09:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:09:27,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:09:28,464][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:09:29,181][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:09:29,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:09:30,617][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:09:31,335][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:09:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:09:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:09:33,490][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:09:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:09:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:09:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:09:36,363][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:09:37,083][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:09:37,801][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:09:38,520][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:09:39,238][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:09:39,958][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:09:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:09:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:09:42,426][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:09:43,146][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:43,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:45,304][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:46,022][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:49,618][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:50,337][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:51,056][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:51,776][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:09:52,496][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:09:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:09:53,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:09:54,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:09:55,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:09:55,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:09:55,677][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:09:56,952][__main__][INFO] - Iteration 172 took 55s (9.40% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 43m 8s. Estimated total time: 15h 28m 30s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 15s. [2026-03-25 17:09:56,955][__main__][INFO] - Starting iteration 172. [2026-03-25 17:09:56,958][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:09:56,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:10:02,045][__main__][INFO] - Number of regex retries in iteration 172: 0 [2026-03-25 17:10:02,047][__main__][INFO] - agents played in iteration 172 are Bob, Alice [2026-03-25 17:10:02,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:10:02,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:10:02,593][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:10:02,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:10:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:10:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:10:04,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:10:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:10:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:10:06,796][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:10:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:10:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:10:08,951][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:10:09,669][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:10:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:10:11,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:10:11,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:10:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:10:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:10:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:10:14,690][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:10:15,410][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:10:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:10:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:10:17,564][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:10:18,280][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:10:18,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:10:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:10:20,435][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:10:21,154][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:10:21,871][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:10:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:10:23,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:10:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:10:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:10:25,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:10:26,181][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:10:26,899][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:10:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:10:28,336][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:10:29,054][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:10:29,773][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:10:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:10:31,212][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:10:31,933][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:10:32,652][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:10:33,370][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:10:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:10:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:10:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:10:36,247][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:10:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:10:37,912][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:10:38,632][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:10:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:10:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:10:40,790][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:10:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:10:42,228][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:10:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:10:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:10:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:10:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:10:45,825][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:10:46,544][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:10:47,264][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:47,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:10:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:10:49,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:50,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:10:51,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:51,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:51,152][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:52,456][__main__][INFO] - Iteration 173 took 55s (9.17% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 38m 41s. Estimated total time: 15h 24m 59s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 29s. [2026-03-25 17:10:52,459][__main__][INFO] - Starting iteration 173. [2026-03-25 17:10:52,463][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:10:52,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:10:57,530][__main__][INFO] - Number of regex retries in iteration 173: 0 [2026-03-25 17:10:57,531][__main__][INFO] - agents played in iteration 173 are Bob, Alice [2026-03-25 17:10:58,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:10:58,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:10:58,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:10:58,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:10:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:10:59,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:02,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:03,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:11:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:11:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:11:06,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:11:07,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:11:08,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:11:08,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:11:09,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:11:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:11:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:11:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:11:12,332][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:11:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:11:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:11:14,485][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:11:15,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:11:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:11:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:11:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:11:18,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:11:18,794][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:11:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:11:20,229][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:11:20,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:11:21,667][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:11:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:11:23,104][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:11:23,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:11:24,542][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:11:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:11:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:11:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:11:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:11:28,135][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:11:28,854][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:11:29,573][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:11:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:11:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:11:31,730][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:11:32,448][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:11:33,390][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:11:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:11:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:11:35,548][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:11:36,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:11:36,985][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:11:37,705][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:11:38,423][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:11:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:11:39,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:11:40,582][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:11:41,300][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:11:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:11:42,738][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:11:43,457][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:11:44,177][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:11:44,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:11:45,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:11:46,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:11:46,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:11:46,553][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:11:47,948][__main__][INFO] - Iteration 174 took 55s (9.13% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 37m 33s. Estimated total time: 15h 24m 47s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 23s. [2026-03-25 17:11:47,952][__main__][INFO] - Starting iteration 174. [2026-03-25 17:11:47,958][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:11:47,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:11:53,077][__main__][INFO] - Number of regex retries in iteration 174: 0 [2026-03-25 17:11:53,078][__main__][INFO] - agents played in iteration 174 are Bob, Alice [2026-03-25 17:11:53,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:11:53,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:11:53,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:11:53,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:11:54,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:11:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:55,694][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:56,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:57,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:58,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:59,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:12:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:12:01,434][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:12:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:12:02,869][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:12:03,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:12:04,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:12:05,023][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:12:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:12:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:12:07,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:12:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:12:08,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:12:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:12:10,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:12:10,771][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:12:11,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:12:12,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:12:12,926][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:12:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:12:14,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:12:15,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:12:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:12:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:12:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:12:17,955][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:12:18,674][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:12:19,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:12:20,109][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:12:20,830][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:12:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:12:22,266][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:12:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:12:23,705][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:12:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:12:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:12:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:12:26,579][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:12:27,298][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:12:28,017][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:12:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:12:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:12:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:12:31,199][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:12:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:12:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:12:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:12:34,076][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:12:34,794][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:12:35,514][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:12:36,233][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:12:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:12:37,672][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:12:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:12:39,110][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:12:39,831][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:12:40,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:12:41,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:12:42,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:12:42,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:12:42,444][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:12:43,809][__main__][INFO] - Iteration 175 took 55s (9.16% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 42m 44s. Estimated total time: 15h 30m 53s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 26s. [2026-03-25 17:12:43,811][__main__][INFO] - Starting iteration 175. [2026-03-25 17:12:43,816][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:12:43,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:48,884][__main__][INFO] - Number of regex retries in iteration 175: 0 [2026-03-25 17:12:48,885][__main__][INFO] - agents played in iteration 175 are Bob, Alice [2026-03-25 17:12:49,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:12:49,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:12:49,521][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:12:49,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:12:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:55,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:12:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:12:56,594][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:12:57,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:12:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:12:58,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:12:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:01,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:13:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:13:05,196][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:13:05,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:13:06,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:13:07,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:13:08,068][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:13:08,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:13:09,506][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:13:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:13:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:13:11,664][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:13:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:13:13,102][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:13:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:13:14,541][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:13:15,262][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:13:15,983][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:13:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:13:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:13:18,144][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:13:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:13:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:13:20,312][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:13:21,034][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:13:21,753][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:13:22,476][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:13:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:13:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:13:24,888][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:13:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:13:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:13:27,051][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:13:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:13:28,495][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:13:29,215][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:13:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:13:30,655][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:13:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:13:32,091][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:13:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:13:33,529][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:13:34,249][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:13:34,968][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:13:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:13:36,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:13:37,135][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:13:38,174][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:13:38,179][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:13:38,181][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:13:39,586][__main__][INFO] - Iteration 176 took 55s (9.09% Gen, 88.39% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 40m 27s. Estimated total time: 15h 29m 32s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 46s. [2026-03-25 17:13:39,589][__main__][INFO] - Starting iteration 176. [2026-03-25 17:13:39,593][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:13:39,594][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:13:44,581][__main__][INFO] - Number of regex retries in iteration 176: 0 [2026-03-25 17:13:44,582][__main__][INFO] - agents played in iteration 176 are Bob, Alice [2026-03-25 17:13:45,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:13:45,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:13:45,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:13:45,177][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:13:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:13:46,505][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:13:47,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:13:47,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:13:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:13:49,373][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:13:50,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:13:50,806][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:52,957][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:54,392][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:13:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:55,826][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:56,542][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:57,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:59,412][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:00,846][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:03,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:14:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:14:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:14:05,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:14:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:14:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:14:08,023][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:14:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:14:09,460][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:14:10,182][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:14:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:14:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:14:12,344][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:14:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:14:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:14:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:14:15,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:14:15,952][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:14:16,672][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:14:17,389][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:14:18,109][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:14:18,828][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:14:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:14:20,498][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:14:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:14:21,935][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:14:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:14:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:14:24,092][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:14:24,811][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:14:25,530][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:14:26,248][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:14:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:14:27,686][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:14:28,403][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:14:29,123][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:14:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:14:30,560][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:14:31,279][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:14:31,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:14:32,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:14:33,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:14:33,763][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:14:33,765][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:14:35,112][__main__][INFO] - Iteration 177 took 55s (8.99% Gen, 88.58% Train). Generation: 4s, Training: 49s. Estimated remaining time: 12h 35m 20s. Estimated total time: 15h 25m 21s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 32s, 500 more iterations: 7h 42m 40s. [2026-03-25 17:14:35,116][__main__][INFO] - Starting iteration 177. [2026-03-25 17:14:35,120][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:14:35,121][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:14:40,144][__main__][INFO] - Number of regex retries in iteration 177: 0 [2026-03-25 17:14:40,146][__main__][INFO] - agents played in iteration 177 are Bob, Alice [2026-03-25 17:14:40,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:14:40,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:14:40,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:14:40,718][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:14:41,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:14:42,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:14:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:14:43,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:14:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:14:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:14:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:14:46,417][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:14:47,137][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:14:47,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:14:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:14:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:14:50,009][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:14:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:14:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:14:52,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:14:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:14:53,602][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:14:54,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:14:55,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:56,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:58,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:00,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:00,791][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:01,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:02,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:02,944][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:03,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:15:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:15:05,818][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:15:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:15:07,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:15:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:15:08,691][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:15:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:15:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:15:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:15:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:15:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:15:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:15:13,719][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:15:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:15:15,156][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:15:16,181][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:15:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:15:17,617][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:15:18,335][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:15:19,054][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:15:19,772][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:15:20,492][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:15:21,209][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:15:21,927][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:15:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:15:23,364][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:15:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:15:24,802][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:15:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:15:26,238][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:15:26,958][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:15:27,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:15:28,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:15:29,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:15:29,357][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:15:29,359][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:15:30,857][__main__][INFO] - Iteration 178 took 55s (9.01% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 38m 1s. Estimated total time: 15h 28m 58s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 53s, 500 more iterations: 7h 44m 29s. [2026-03-25 17:15:30,860][__main__][INFO] - Starting iteration 178. [2026-03-25 17:15:30,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:15:30,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:15:35,887][__main__][INFO] - Number of regex retries in iteration 178: 0 [2026-03-25 17:15:35,888][__main__][INFO] - agents played in iteration 178 are Bob, Alice [2026-03-25 17:15:36,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:15:36,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:15:36,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:15:36,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:15:37,114][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:15:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:15:38,481][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:15:39,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:15:39,912][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:15:40,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:15:41,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:15:42,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:15:42,780][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:15:43,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:15:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:15:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:15:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:15:46,364][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:15:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:15:47,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:15:48,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:15:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:15:49,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:15:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:15:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:15:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:15:52,820][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:15:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:15:54,254][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:15:54,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:57,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:57,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:16:00,708][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:16:01,425][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:16:02,140][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:16:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:16:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:16:04,296][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:16:05,014][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:16:05,732][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:16:06,450][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:16:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:16:07,887][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:16:08,605][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:16:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:16:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:16:10,761][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:16:11,708][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:16:12,427][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:16:13,144][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:16:13,861][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:16:14,580][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:16:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:16:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:16:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:16:17,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:16:18,174][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:16:18,891][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:16:19,610][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:16:20,328][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:16:21,046][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:16:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:16:22,482][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:16:23,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:16:23,925][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:16:25,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:16:25,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:16:25,068][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:16:26,565][__main__][INFO] - Iteration 179 took 55s (9.02% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 36m 30s. Estimated total time: 15h 28m 23s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 11s. [2026-03-25 17:16:26,568][__main__][INFO] - Starting iteration 179. [2026-03-25 17:16:26,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:16:26,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:16:31,678][__main__][INFO] - Number of regex retries in iteration 179: 0 [2026-03-25 17:16:31,679][__main__][INFO] - agents played in iteration 179 are Bob, Alice [2026-03-25 17:16:32,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:16:32,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:16:32,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:16:32,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:16:32,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:16:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:16:34,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:16:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:16:35,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:16:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:16:37,130][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:16:37,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:16:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:16:39,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:16:40,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:16:40,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:16:41,433][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:16:42,149][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:16:42,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:16:43,582][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:16:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:16:45,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:16:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:16:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:16:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:16:47,882][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:16:48,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:16:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:16:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:16:50,752][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:16:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:16:52,188][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:16:52,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:16:53,622][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:16:54,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:16:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:16:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:16:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:16:57,210][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:16:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:16:58,646][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:16:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:00,800][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:01,517][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:17:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:17:02,953][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:17:03,671][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:17:04,389][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:17:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:17:05,826][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:17:06,542][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:17:07,501][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:17:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:17:08,938][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:17:09,657][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:17:10,374][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:17:11,092][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:17:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:17:12,528][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:17:13,246][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:17:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:17:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:17:15,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:17:16,118][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:17:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:17:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:17:18,274][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:17:18,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:17:19,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:17:20,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:17:20,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:17:20,689][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:17:21,945][__main__][INFO] - Iteration 180 took 55s (9.22% Gen, 88.51% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 30m 7s. Estimated total time: 15h 22m 54s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 27s. [2026-03-25 17:17:21,948][__main__][INFO] - Starting iteration 180. [2026-03-25 17:17:21,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:17:21,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:17:27,175][__main__][INFO] - Number of regex retries in iteration 180: 0 [2026-03-25 17:17:27,176][__main__][INFO] - agents played in iteration 180 are Bob, Alice [2026-03-25 17:17:27,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:17:27,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:17:27,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:17:27,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:17:28,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:17:29,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:17:29,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:17:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:17:31,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:17:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:17:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:17:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:17:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:17:34,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:17:35,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:17:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:17:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:17:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:17:38,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:17:39,099][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:17:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:17:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:17:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:17:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:17:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:17:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:17:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:17:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:17:45,551][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:17:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:17:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:17:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:17:48,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:17:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:17:49,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:17:50,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:17:51,294][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:17:52,010][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:17:52,729][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:17:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:17:54,163][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:17:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:55,600][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:17:57,755][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:17:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:17:59,191][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:17:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:00,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:01,346][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:02,065][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:18:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:18:05,197][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:18:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:18:06,632][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:18:07,351][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:18:08,069][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:18:08,789][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:18:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:18:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:18:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:18:11,666][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:18:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:18:13,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:18:13,826][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:18:14,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:18:15,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:18:16,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:18:16,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:18:16,321][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:18:17,658][__main__][INFO] - Iteration 181 took 55s (9.38% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 34m 44s. Estimated total time: 15h 28m 27s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 13s. [2026-03-25 17:18:17,661][__main__][INFO] - Starting iteration 181. [2026-03-25 17:18:17,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:18:17,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:18:22,975][__main__][INFO] - Number of regex retries in iteration 181: 0 [2026-03-25 17:18:22,976][__main__][INFO] - agents played in iteration 181 are Bob, Alice [2026-03-25 17:18:23,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:18:23,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:18:23,525][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:18:23,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:18:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:18:24,855][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:18:25,574][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:18:26,289][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:18:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:18:27,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:18:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:18:29,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:18:29,870][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:18:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:18:31,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:18:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:18:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:18:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:18:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:18:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:18:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:18:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:18:37,060][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:18:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:18:38,500][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:18:39,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:18:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:18:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:18:41,379][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:18:42,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:18:42,821][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:18:43,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:18:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:18:44,984][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:18:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:18:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:18:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:18:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:18:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:18:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:18:50,017][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:18:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:18:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:18:52,175][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:18:52,894][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:18:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:18:54,335][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:18:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:18:55,773][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:57,213][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:58,877][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:59,598][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:01,035][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:01,755][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:02,475][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:03,193][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:03,913][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:19:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:19:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:19:06,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:19:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:19:07,514][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:19:08,233][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:19:08,953][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:19:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:19:10,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:19:11,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:19:12,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:19:12,136][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:19:12,138][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:19:13,539][__main__][INFO] - Iteration 182 took 55s (9.50% Gen, 87.98% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 36m 37s. Estimated total time: 15h 31m 16s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 7s, 500 more iterations: 7h 45m 38s. [2026-03-25 17:19:13,542][__main__][INFO] - Starting iteration 182. [2026-03-25 17:19:13,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:19:13,546][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:19:18,687][__main__][INFO] - Number of regex retries in iteration 182: 0 [2026-03-25 17:19:18,689][__main__][INFO] - agents played in iteration 182 are Bob, Alice [2026-03-25 17:19:19,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:19:19,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:19:19,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:19:19,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:19:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:19:20,644][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:19:21,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:19:22,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:19:22,794][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:19:23,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:19:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:19:24,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:19:25,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:19:26,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:19:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:19:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:19:28,524][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:19:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:19:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:19:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:19:31,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:19:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:19:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:19:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:19:34,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:19:34,969][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:19:35,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:19:36,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:19:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:19:37,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:19:38,554][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:19:39,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:19:39,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:19:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:19:41,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:19:42,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:19:42,859][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:19:43,574][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:19:44,292][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:19:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:19:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:19:46,443][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:19:47,162][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:19:47,878][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:19:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:19:49,313][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:19:50,032][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:19:50,750][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:19:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:19:52,186][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:19:52,903][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:19:53,622][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:19:54,580][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:19:55,299][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:56,014][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:57,451][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:58,887][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:59,606][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:00,324][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:01,760][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:03,198][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:03,918][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:04,636][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:06,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:06,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:20:07,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:07,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:07,816][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:20:09,145][__main__][INFO] - Iteration 183 took 55s (9.25% Gen, 88.36% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 31m 5s. Estimated total time: 15h 26m 40s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 40s, 500 more iterations: 7h 43m 20s. [2026-03-25 17:20:09,147][__main__][INFO] - Starting iteration 183. [2026-03-25 17:20:09,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:20:09,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:20:14,134][__main__][INFO] - Number of regex retries in iteration 183: 0 [2026-03-25 17:20:14,136][__main__][INFO] - agents played in iteration 183 are Bob, Alice [2026-03-25 17:20:14,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:20:14,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:20:14,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:20:14,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:20:15,409][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:20:16,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:20:16,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:20:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:20:18,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:20:18,926][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:20:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:20:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:20:21,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:20:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:20:22,508][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:20:23,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:20:23,945][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:20:24,661][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:20:25,378][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:20:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:20:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:20:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:20:28,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:20:28,964][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:20:29,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:20:30,400][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:20:31,116][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:20:31,833][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:20:32,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:20:33,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:20:33,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:20:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:20:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:20:36,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:20:36,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:20:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:20:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:20:39,011][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:20:39,727][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:20:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:20:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:20:41,880][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:20:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:20:43,318][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:20:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:20:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:20:45,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:20:46,191][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:20:46,908][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:20:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:20:48,348][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:20:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:20:50,042][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:20:50,760][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:20:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:20:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:20:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:20:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:20:54,352][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:20:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:55,790][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:57,227][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:57,948][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:58,669][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:21:00,108][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:21:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:21:01,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:21:02,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:21:07,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:21:07,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:21:07,422][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:21:08,746][__main__][INFO] - Iteration 184 took 59s (8.36% Gen, 89.41% Train). Generation: 4s, Training: 53s. Estimated remaining time: 13h 36m 42s. Estimated total time: 16h 33m 16s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 19s, 500 more iterations: 8h 16m 38s. [2026-03-25 17:21:08,749][__main__][INFO] - Starting iteration 184. [2026-03-25 17:21:08,754][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:21:08,755][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:21:13,827][__main__][INFO] - Number of regex retries in iteration 184: 0 [2026-03-25 17:21:13,828][__main__][INFO] - agents played in iteration 184 are Bob, Alice [2026-03-25 17:21:14,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:21:14,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:21:14,390][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:21:14,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:21:15,078][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:21:15,723][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:21:16,441][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:21:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:21:17,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:21:18,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:21:19,300][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:21:20,017][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:21:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:21:21,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:21:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:21:22,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:21:23,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:21:24,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:21:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:21:25,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:21:26,465][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:21:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:21:27,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:21:28,614][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:21:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:21:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:21:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:21:31,492][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:21:32,208][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:21:32,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:21:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:21:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:21:35,086][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:21:35,805][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:21:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:21:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:21:37,963][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:21:38,683][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:21:39,402][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:21:40,122][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:21:40,841][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:21:41,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:21:42,283][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:21:43,002][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:21:43,723][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:21:44,444][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:21:45,163][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:21:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:21:46,602][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:21:47,321][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:21:48,040][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:21:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:21:49,737][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:21:50,459][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:21:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:21:51,900][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:21:52,620][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:21:53,341][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:21:54,058][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:21:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:21:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:21:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:21:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:21:57,650][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:21:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:21:59,088][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:21:59,805][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:22:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:22:01,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:22:01,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:22:02,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:22:02,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:22:02,966][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:22:04,287][__main__][INFO] - Iteration 185 took 55s (9.14% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 28m 5s. Estimated total time: 15h 25m 35s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 47s. [2026-03-25 17:22:04,289][__main__][INFO] - Starting iteration 185. [2026-03-25 17:22:04,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:22:04,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:22:09,407][__main__][INFO] - Number of regex retries in iteration 185: 0 [2026-03-25 17:22:09,408][__main__][INFO] - agents played in iteration 185 are Bob, Alice [2026-03-25 17:22:09,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:22:09,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:22:09,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:22:10,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:22:10,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:22:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:22:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:22:12,870][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:22:13,587][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:22:14,307][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:22:15,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:22:15,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:22:16,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:22:17,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:22:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:22:18,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:22:19,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:22:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:22:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:22:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:22:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:22:22,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:22:23,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:22:24,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:22:25,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:22:25,814][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:22:26,533][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:22:27,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:22:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:22:28,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:22:29,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:22:30,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:22:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:22:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:22:32,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:22:33,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:22:33,730][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:22:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:22:35,171][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:22:35,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:22:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:22:37,335][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:22:38,056][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:22:38,775][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:22:39,495][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:22:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:22:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:22:41,649][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:22:42,367][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:22:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:22:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:22:44,523][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:22:45,466][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:22:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:22:46,903][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:22:47,623][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:22:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:22:49,060][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:22:49,779][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:22:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:22:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:22:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:22:52,650][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:22:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:22:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:22:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:22:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:22:56,258][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:22:56,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:22:57,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:22:58,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:22:58,749][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:22:58,751][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:23:00,189][__main__][INFO] - Iteration 186 took 55s (9.15% Gen, 88.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 33m 11s. Estimated total time: 15h 31m 37s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 48s. [2026-03-25 17:23:00,191][__main__][INFO] - Starting iteration 186. [2026-03-25 17:23:00,196][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:23:00,196][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:23:05,254][__main__][INFO] - Number of regex retries in iteration 186: 0 [2026-03-25 17:23:05,255][__main__][INFO] - agents played in iteration 186 are Bob, Alice [2026-03-25 17:23:05,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:23:05,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:23:05,843][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:23:05,844][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:23:06,598][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:23:07,249][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:23:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:23:08,692][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:23:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:23:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:23:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:23:11,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:23:12,276][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:23:12,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:23:13,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:23:14,427][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:23:15,142][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:23:15,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:23:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:23:17,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:23:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:23:18,725][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:23:19,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:23:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:23:20,877][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:23:21,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:23:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:23:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:23:23,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:23:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:23:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:23:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:23:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:23:27,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:23:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:23:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:23:29,481][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:23:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:23:30,916][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:23:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:23:32,351][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:23:33,068][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:23:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:23:34,504][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:23:35,223][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:23:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:23:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:23:37,376][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:23:38,092][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:23:38,813][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:23:39,532][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:23:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:23:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:23:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:23:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:23:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:23:44,107][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:23:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:23:45,548][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:23:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:23:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:23:47,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:23:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:23:49,140][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:23:49,858][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:23:50,576][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:23:51,295][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:23:52,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:23:52,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:23:53,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:23:54,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:23:54,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:23:54,631][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:23:55,946][__main__][INFO] - Iteration 187 took 55s (9.07% Gen, 88.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 29m 50s. Estimated total time: 15h 29m 12s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 36s. [2026-03-25 17:23:55,948][__main__][INFO] - Starting iteration 187. [2026-03-25 17:23:55,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:23:55,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:24:01,034][__main__][INFO] - Number of regex retries in iteration 187: 0 [2026-03-25 17:24:01,035][__main__][INFO] - agents played in iteration 187 are Bob, Alice [2026-03-25 17:24:01,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:24:01,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:24:01,593][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:24:01,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:24:02,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:24:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:24:03,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:24:04,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:24:05,069][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:24:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:24:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:24:07,221][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:24:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:24:08,655][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:24:09,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:24:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:24:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:24:11,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:24:12,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:24:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:24:13,676][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:24:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:24:15,110][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:24:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:24:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:24:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:24:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:24:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:24:19,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:24:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:24:20,845][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:24:21,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:24:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:24:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:24:23,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:24:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:24:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:24:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:24:26,584][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:24:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:24:28,021][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:24:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:24:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:24:30,174][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:24:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:24:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:24:32,326][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:24:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:24:33,762][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:24:34,482][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:24:35,199][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:24:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:24:36,864][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:37,584][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:39,022][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:41,177][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:24:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:24:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:24:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:24:44,051][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:24:44,768][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:24:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:24:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:24:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:24:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:24:48,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:24:49,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:24:50,062][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:24:50,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:24:50,066][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:24:52,200][__main__][INFO] - Iteration 188 took 56s (9.03% Gen, 87.17% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 37m 11s. Estimated total time: 15h 37m 29s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 44s. [2026-03-25 17:24:52,204][__main__][INFO] - Starting iteration 188. [2026-03-25 17:24:52,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:24:52,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:25:03,137][__main__][INFO] - Number of regex retries in iteration 188: 0 [2026-03-25 17:25:03,138][__main__][INFO] - agents played in iteration 188 are Bob, Alice [2026-03-25 17:25:03,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:25:03,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:25:03,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:25:03,708][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:25:04,390][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:25:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:25:05,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:25:06,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:25:07,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:25:07,886][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:25:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:25:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:25:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:25:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:25:11,457][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:25:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:25:12,884][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:25:13,597][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:25:14,312][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:25:15,027][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:25:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:25:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:25:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:25:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:25:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:25:19,308][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:25:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:25:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:25:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:25:22,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:25:22,876][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:25:23,589][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:25:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:25:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:25:25,733][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:25:26,448][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:25:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:25:27,877][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:25:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:25:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:25:30,023][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:25:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:25:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:25:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:25:32,881][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:25:33,597][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:25:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:25:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:25:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:25:36,457][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:25:37,172][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:25:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:25:38,855][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:25:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:25:40,289][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:25:41,004][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:25:41,719][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:25:42,435][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:25:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:25:43,867][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:25:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:25:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:25:46,015][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:25:46,732][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:47,447][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:48,165][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:50,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:51,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:25:52,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:52,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:52,254][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:53,619][__main__][INFO] - Iteration 189 took 1m 1s (17.80% Gen, 79.98% Train). Generation: 10s, Training: 49s. Estimated remaining time: 14h 2m 13s. Estimated total time: 17h 3m 32s. Time estimates for 10 more iterations: 10m 14s, 100 more iterations: 1h 42m 21s, 500 more iterations: 8h 31m 46s. [2026-03-25 17:25:53,623][__main__][INFO] - Starting iteration 189. [2026-03-25 17:25:53,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:25:53,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:25:58,783][__main__][INFO] - Number of regex retries in iteration 189: 0 [2026-03-25 17:25:58,785][__main__][INFO] - agents played in iteration 189 are Bob, Alice [2026-03-25 17:25:59,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:25:59,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:25:59,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:25:59,376][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:26:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:01,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:02,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:02,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:26:03,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:26:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:26:04,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:26:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:26:06,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:26:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:26:07,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:26:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:26:09,273][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:26:09,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:26:10,705][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:26:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:26:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:26:12,850][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:26:13,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:26:14,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:26:14,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:26:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:26:16,424][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:26:17,139][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:26:17,856][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:26:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:26:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:26:20,003][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:26:20,718][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:26:21,433][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:26:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:26:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:26:23,581][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:26:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:26:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:26:25,729][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:26:26,445][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:26:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:26:27,879][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:26:28,596][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:26:29,312][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:26:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:26:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:26:31,462][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:26:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:26:32,898][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:26:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:26:34,585][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:26:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:26:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:26:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:26:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:26:38,173][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:26:38,890][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:26:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:26:40,323][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:26:41,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:26:41,758][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:26:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:26:43,190][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:26:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:26:44,623][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:26:45,338][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:26:46,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:26:46,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:26:48,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:26:48,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:26:48,015][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:26:49,509][__main__][INFO] - Iteration 190 took 55s (9.23% Gen, 88.10% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 29m 8s. Estimated total time: 15h 31m 23s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 41s. [2026-03-25 17:26:49,511][__main__][INFO] - Starting iteration 190. [2026-03-25 17:26:49,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:26:49,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:54,523][__main__][INFO] - Number of regex retries in iteration 190: 0 [2026-03-25 17:26:54,524][__main__][INFO] - agents played in iteration 190 are Bob, Alice [2026-03-25 17:26:55,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:26:55,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:26:55,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:26:55,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:26:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:56,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:57,927][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:58,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:26:59,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:00,072][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:00,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:01,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:02,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:02,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:03,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:04,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:05,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:05,790][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:27:06,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:27:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:27:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:27:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:27:09,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:27:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:27:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:27:11,519][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:27:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:27:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:27:13,664][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:27:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:27:15,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:27:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:27:16,525][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:27:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:27:17,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:27:18,673][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:27:19,389][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:27:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:27:20,822][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:27:21,536][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:27:22,255][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:27:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:27:23,688][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:27:24,404][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:27:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:27:25,836][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:27:26,552][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:27:27,268][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:27:27,985][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:27:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:27:29,420][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:27:30,367][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:27:31,084][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:27:31,799][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:27:32,516][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:27:33,235][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:27:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:27:34,670][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:27:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:27:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:27:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:27:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:27:38,256][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:27:38,972][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:27:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:27:40,406][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:27:41,124][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:27:41,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:27:42,559][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:27:43,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:27:43,818][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:27:43,820][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:27:45,290][__main__][INFO] - Iteration 191 took 55s (8.98% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 26m 25s. Estimated total time: 15h 29m 36s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 48s. [2026-03-25 17:27:45,294][__main__][INFO] - Starting iteration 191. [2026-03-25 17:27:45,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:27:45,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:27:50,401][__main__][INFO] - Number of regex retries in iteration 191: 0 [2026-03-25 17:27:50,402][__main__][INFO] - agents played in iteration 191 are Bob, Alice [2026-03-25 17:27:50,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:27:50,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:27:50,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:27:50,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:27:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:27:52,298][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:27:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:27:53,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:27:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:55,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:57,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:59,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:28:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:28:00,878][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:28:01,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:02,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:03,025][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:03,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:28:05,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:28:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:28:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:07,319][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:08,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:08,750][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:28:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:28:10,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:28:10,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:28:11,617][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:28:12,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:28:13,049][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:28:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:28:14,483][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:28:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:28:15,918][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:28:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:28:17,351][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:28:18,069][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:28:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:28:19,503][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:28:20,218][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:28:20,935][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:28:21,653][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:28:22,370][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:28:23,087][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:28:23,804][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:28:24,523][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:28:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:28:26,198][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:28:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:28:27,632][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:28:28,351][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:28:29,069][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:28:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:28:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:28:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:28:31,938][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:28:32,656][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:28:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:28:34,092][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:28:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:28:35,526][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:28:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:28:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:28:37,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:28:38,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:28:39,444][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:28:39,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:28:39,448][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:28:40,889][__main__][INFO] - Iteration 192 took 55s (9.17% Gen, 88.23% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 22m 25s. Estimated total time: 15h 26m 31s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 15s. [2026-03-25 17:28:40,892][__main__][INFO] - Starting iteration 192. [2026-03-25 17:28:40,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:28:40,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:28:45,924][__main__][INFO] - Number of regex retries in iteration 192: 0 [2026-03-25 17:28:45,925][__main__][INFO] - agents played in iteration 192 are Bob, Alice [2026-03-25 17:28:46,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:28:46,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:28:46,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:28:46,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:28:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:28:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:28:48,522][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:28:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:28:49,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:28:50,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:28:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:28:52,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:28:52,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:28:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:28:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:28:54,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:28:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:28:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:28:57,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:57,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:58,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:59,252][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:59,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:29:00,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:29:01,398][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:29:02,113][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:29:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:29:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:29:04,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:29:04,980][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:29:05,698][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:29:06,414][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:29:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:29:07,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:29:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:29:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:29:10,005][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:29:10,721][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:29:11,437][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:29:12,153][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:29:12,871][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:29:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:29:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:29:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:29:15,742][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:29:16,458][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:29:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:29:17,890][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:29:18,608][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:29:19,324][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:29:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:29:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:29:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:29:22,451][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:29:23,167][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:29:23,883][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:29:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:29:25,318][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:29:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:29:26,753][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:29:27,470][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:29:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:29:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:29:29,625][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:29:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:29:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:29:31,785][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:29:32,504][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:29:33,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:29:33,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:29:34,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:29:34,971][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:29:34,973][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:29:36,793][__main__][INFO] - Iteration 193 took 55s (8.99% Gen, 87.75% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 26m 36s. Estimated total time: 15h 31m 38s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 49s. [2026-03-25 17:29:36,797][__main__][INFO] - Starting iteration 193. [2026-03-25 17:29:36,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:29:36,803][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:29:41,844][__main__][INFO] - Number of regex retries in iteration 193: 0 [2026-03-25 17:29:41,846][__main__][INFO] - agents played in iteration 193 are Bob, Alice [2026-03-25 17:29:42,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:29:42,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:29:42,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:29:42,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:29:43,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:29:43,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:29:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:29:45,166][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:29:45,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:29:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:29:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:29:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:29:48,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:29:49,455][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:29:50,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:29:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:29:51,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:29:52,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:29:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:29:53,742][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:29:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:29:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:29:55,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:29:56,604][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:29:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:29:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:29:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:29:59,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:30:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:30:00,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:30:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:30:02,330][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:30:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:30:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:30:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:30:05,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:30:05,911][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:30:06,629][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:30:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:30:08,062][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:30:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:30:09,498][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:30:10,215][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:30:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:30:11,646][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:30:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:30:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:30:13,796][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:30:14,513][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:30:15,231][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:30:15,949][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:30:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:30:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:30:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:30:19,049][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:30:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:30:20,487][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:30:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:30:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:30:22,638][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:30:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:30:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:30:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:30:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:30:26,228][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:30:26,946][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:30:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:30:28,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:30:29,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:30:29,822][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:30:30,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:30:30,803][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:30:30,804][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:30:32,150][__main__][INFO] - Iteration 194 took 55s (9.11% Gen, 88.45% Train). Generation: 5s, Training: 48s. Estimated remaining time: 12h 16m 33s. Estimated total time: 15h 22m 30s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 15s. [2026-03-25 17:30:32,152][__main__][INFO] - Starting iteration 194. [2026-03-25 17:30:32,156][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:30:32,157][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:30:38,587][__main__][INFO] - Number of regex retries in iteration 194: 0 [2026-03-25 17:30:38,588][__main__][INFO] - agents played in iteration 194 are Bob, Alice [2026-03-25 17:30:39,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:30:39,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:30:39,142][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:30:39,143][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:30:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:30:40,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:30:41,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:30:41,914][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:30:42,627][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:30:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:30:44,056][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:30:44,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:30:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:30:46,202][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:30:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:30:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:30:48,349][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:30:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:30:49,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:30:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:30:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:30:51,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:30:52,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:30:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:30:54,074][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:30:54,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:30:55,507][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:30:56,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:30:56,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:30:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:30:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:30:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:30:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:31:00,528][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:31:01,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:31:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:31:02,681][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:31:03,399][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:31:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:31:04,834][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:31:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:31:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:31:06,992][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:31:07,710][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:31:08,430][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:31:09,147][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:31:09,867][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:31:10,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:31:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:31:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:31:12,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:31:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:31:14,488][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:31:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:31:15,925][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:31:16,643][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:31:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:31:18,081][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:31:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:31:19,518][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:31:20,238][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:31:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:31:21,675][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:31:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:31:23,115][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:31:23,832][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:31:24,550][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:31:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:31:25,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:31:26,725][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:31:27,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:31:27,687][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:31:27,688][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:31:29,066][__main__][INFO] - Iteration 195 took 56s (11.30% Gen, 86.27% Train). Generation: 6s, Training: 49s. Estimated remaining time: 12h 41m 37s. Estimated total time: 15h 48m 31s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 51s, 500 more iterations: 7h 54m 15s. [2026-03-25 17:31:29,068][__main__][INFO] - Starting iteration 195. [2026-03-25 17:31:29,072][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:31:29,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:31:34,108][__main__][INFO] - Number of regex retries in iteration 195: 0 [2026-03-25 17:31:34,109][__main__][INFO] - agents played in iteration 195 are Bob, Alice [2026-03-25 17:31:34,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:31:34,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:31:34,675][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:31:34,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:31:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:31:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:31:36,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:31:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:31:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:31:38,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:31:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:31:41,038][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:31:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:31:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:31:43,183][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:31:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:31:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:31:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:31:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:31:46,761][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:31:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:31:48,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:31:48,909][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:31:49,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:31:50,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:31:51,058][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:31:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:31:52,487][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:31:53,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:31:53,920][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:31:54,636][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:31:55,352][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:31:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:31:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:31:57,503][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:31:58,221][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:31:58,936][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:31:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:32:00,370][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:32:01,087][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:32:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:32:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:32:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:32:03,954][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:32:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:32:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:32:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:32:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:32:07,538][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:32:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:32:08,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:32:09,925][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:32:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:32:11,357][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:32:12,076][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:32:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:32:13,510][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:32:14,227][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:32:14,943][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:32:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:32:16,380][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:32:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:32:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:32:18,532][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:32:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:32:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:32:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:32:21,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:32:22,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:32:23,059][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:32:23,063][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:32:23,065][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:32:24,494][__main__][INFO] - Iteration 196 took 55s (9.09% Gen, 88.33% Train). Generation: 5s, Training: 48s. Estimated remaining time: 12h 15m 53s. Estimated total time: 15h 23m 43s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 51s. [2026-03-25 17:32:24,497][__main__][INFO] - Starting iteration 196. [2026-03-25 17:32:24,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:32:24,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:32:29,588][__main__][INFO] - Number of regex retries in iteration 196: 0 [2026-03-25 17:32:29,589][__main__][INFO] - agents played in iteration 196 are Bob, Alice [2026-03-25 17:32:30,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:32:30,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:32:30,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:32:30,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:32:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:32:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:32:32,256][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:32:32,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:32:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:32:34,406][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:32:35,121][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:32:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:32:36,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:32:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:32:37,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:32:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:32:39,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:32:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:32:40,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:32:41,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:32:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:32:43,007][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:32:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:32:44,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:32:45,161][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:32:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:32:46,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:32:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:32:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:32:48,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:32:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:32:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:32:50,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:32:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:32:52,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:32:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:32:53,776][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:32:54,494][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:32:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:32:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:32:56,649][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:32:57,367][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:32:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:32:58,805][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:32:59,524][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:33:00,243][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:33:00,963][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:33:01,682][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:33:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:33:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:33:03,838][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:33:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:33:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:33:06,230][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:33:06,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:33:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:33:08,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:33:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:33:09,818][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:33:10,535][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:33:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:33:11,973][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:33:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:33:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:33:14,122][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:33:14,840][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:33:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:33:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:33:16,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:33:17,721][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:33:18,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:33:18,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:33:18,841][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:33:20,341][__main__][INFO] - Iteration 197 took 55s (9.11% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 21m 56s. Estimated total time: 15h 30m 42s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 21s. [2026-03-25 17:33:20,344][__main__][INFO] - Starting iteration 197. [2026-03-25 17:33:20,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:33:20,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:33:25,596][__main__][INFO] - Number of regex retries in iteration 197: 0 [2026-03-25 17:33:25,597][__main__][INFO] - agents played in iteration 197 are Bob, Alice [2026-03-25 17:33:26,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:33:26,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:33:26,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:33:26,242][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:33:26,933][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:33:27,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:33:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:33:29,012][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:33:29,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:33:30,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:33:31,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:33:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:33:32,603][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:33:33,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:33:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:33:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:33:35,476][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:33:36,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:33:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:33:37,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:33:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:33:39,075][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:33:39,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:33:40,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:41,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:44,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:44,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:45,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:46,971][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:47,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:33:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:33:49,124][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:33:49,841][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:33:50,561][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:33:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:33:51,996][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:33:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:33:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:33:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:33:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:33:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:33:56,307][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:33:57,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:33:57,744][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:33:58,461][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:33:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:33:59,900][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:00,617][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:01,635][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:02,353][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:03,071][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:34:03,789][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:34:04,507][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:34:05,227][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:34:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:34:06,663][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:34:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:34:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:34:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:34:09,541][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:34:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:34:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:34:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:34:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:34:13,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:34:13,912][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:34:14,928][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:34:14,931][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:34:14,932][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:34:16,259][__main__][INFO] - Iteration 198 took 55s (9.39% Gen, 88.24% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 22m 10s. Estimated total time: 15h 31m 52s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 56s. [2026-03-25 17:34:16,262][__main__][INFO] - Starting iteration 198. [2026-03-25 17:34:16,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:34:16,266][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:34:21,371][__main__][INFO] - Number of regex retries in iteration 198: 0 [2026-03-25 17:34:21,372][__main__][INFO] - agents played in iteration 198 are Bob, Alice [2026-03-25 17:34:21,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:34:22,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:34:22,006][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:34:22,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:34:22,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:34:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:34:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:34:24,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:34:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:34:26,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:34:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:34:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:34:28,473][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:34:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:34:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:34:30,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:34:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:34:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:34:32,790][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:34:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:34:34,227][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:34:34,944][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:34:35,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:34:36,378][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:34:37,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:34:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:34:38,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:34:39,245][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:34:39,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:34:40,678][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:34:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:34:42,112][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:34:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:34:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:34:44,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:34:44,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:34:45,695][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:46,412][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:49,287][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:50,725][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:34:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:34:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:34:53,597][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:34:54,316][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:34:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:34:55,756][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:57,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:58,862][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:34:59,581][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:00,302][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:03,899][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:35:04,616][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:35:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:35:06,056][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:35:06,775][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:35:07,494][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:35:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:35:08,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:35:09,685][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:35:10,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:35:10,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:35:10,691][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:35:12,051][__main__][INFO] - Iteration 199 took 55s (9.15% Gen, 88.41% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 19m 9s. Estimated total time: 15h 29m 47s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 53s. [2026-03-25 17:35:12,054][__main__][INFO] - Starting iteration 199. [2026-03-25 17:35:12,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:35:12,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:35:17,457][__main__][INFO] - Number of regex retries in iteration 199: 0 [2026-03-25 17:35:17,458][__main__][INFO] - agents played in iteration 199 are Bob, Alice [2026-03-25 17:35:17,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:35:18,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:35:18,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:35:18,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:35:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:35:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:35:20,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:35:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:35:21,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:35:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:35:22,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:35:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:35:24,433][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:35:25,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:35:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:35:26,584][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:35:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:35:28,017][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:35:28,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:35:29,451][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:35:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:35:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:35:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:35:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:35:33,039][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:35:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:35:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:35:35,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:35:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:35:36,628][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:35:37,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:35:38,066][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:35:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:35:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:35:40,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:35:40,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:35:41,659][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:35:42,377][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:35:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:35:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:35:44,533][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:35:45,253][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:35:45,970][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:35:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:35:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:35:48,124][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:35:48,842][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:35:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:35:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:35:50,998][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:35:51,716][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:35:52,434][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:35:53,404][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:35:54,124][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:35:54,843][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:55,562][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:56,282][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:57,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:59,875][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:36:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:36:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:36:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:36:04,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:36:05,662][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:36:06,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:36:06,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:36:06,679][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:36:08,078][__main__][INFO] - Iteration 200 took 56s (9.64% Gen, 87.86% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 22m 8s. Estimated total time: 15h 33m 42s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 51s. [2026-03-25 17:36:08,081][__main__][INFO] - Starting iteration 200. [2026-03-25 17:36:08,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:36:08,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:36:13,220][__main__][INFO] - Number of regex retries in iteration 200: 0 [2026-03-25 17:36:13,221][__main__][INFO] - agents played in iteration 200 are Bob, Alice [2026-03-25 17:36:13,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:36:13,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:36:13,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:36:13,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:36:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:36:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:36:15,870][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:36:16,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:36:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:36:18,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:36:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:36:19,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:36:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:36:20,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:36:21,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:36:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:36:23,019][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:36:23,736][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:36:24,450][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:36:25,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:36:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:36:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:36:27,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:36:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:36:28,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:36:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:36:30,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:36:30,889][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:36:31,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:36:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:36:33,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:36:33,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:36:34,469][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:36:35,186][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:36:35,903][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:36:36,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:36:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:36:38,054][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:36:38,772][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:36:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:36:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:36:40,922][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:36:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:36:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:36:43,075][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:36:43,790][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:36:44,507][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:36:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:36:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:36:46,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:36:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:36:48,090][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:36:49,055][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:36:49,772][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:36:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:36:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:36:51,924][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:36:52,641][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:36:53,359][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:36:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:36:54,792][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:36:55,508][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:36:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:36:59,094][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:36:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:37:00,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:37:01,313][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:37:02,411][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:37:02,416][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:37:02,419][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:37:05,307][__main__][INFO] - Iteration 201 took 57s (8.96% Gen, 85.99% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 41m 2s. Estimated total time: 15h 53m 33s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 21s, 500 more iterations: 7h 56m 46s. [2026-03-25 17:37:05,311][__main__][INFO] - Starting iteration 201. [2026-03-25 17:37:05,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:37:05,316][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:37:10,447][__main__][INFO] - Number of regex retries in iteration 201: 0 [2026-03-25 17:37:10,448][__main__][INFO] - agents played in iteration 201 are Bob, Alice [2026-03-25 17:37:10,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:37:11,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:37:11,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:37:11,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:37:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:37:12,328][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:37:13,045][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:37:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:37:14,470][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:37:15,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:37:15,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:37:16,614][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:37:17,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:37:18,045][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:37:18,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:37:19,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:37:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:37:20,904][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:37:21,618][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:37:22,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:37:23,048][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:37:23,764][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:37:24,480][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:37:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:37:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:37:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:37:27,343][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:37:28,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:37:28,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:37:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:37:30,206][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:37:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:37:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:37:32,352][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:37:33,068][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:37:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:37:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:37:44,142][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:37:44,854][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:37:45,568][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:37:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:37:46,993][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:37:47,707][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:37:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:37:49,132][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:37:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:37:50,559][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:37:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:37:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:37:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:37:53,414][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:37:54,127][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:37:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:37:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:37:56,501][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:37:57,216][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:37:57,929][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:37:58,642][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:37:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:38:00,069][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:38:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:38:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:38:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:38:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:38:03,639][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:38:04,353][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:38:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:38:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:38:06,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:38:07,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:55 [2026-03-25 17:38:08,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:38:08,199][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:38:08,201][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:38:09,517][__main__][INFO] - Iteration 202 took 1m 4s (7.99% Gen, 89.95% Train). Generation: 5s, Training: 57s. Estimated remaining time: 14h 36m 28s. Estimated total time: 17h 50m 3s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 0s, 500 more iterations: 8h 55m 1s. [2026-03-25 17:38:09,520][__main__][INFO] - Starting iteration 202. [2026-03-25 17:38:09,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:38:09,525][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:38:24,991][__main__][INFO] - Number of regex retries in iteration 202: 0 [2026-03-25 17:38:24,992][__main__][INFO] - agents played in iteration 202 are Bob, Alice [2026-03-25 17:38:25,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:38:25,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:38:25,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:38:25,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:38:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:38:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:38:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:38:28,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:38:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:38:29,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:38:30,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:38:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:38:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:38:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:38:33,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:38:33,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:38:34,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:38:35,379][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:38:36,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:38:36,801][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:38:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:38:38,224][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:38:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:38:39,649][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:38:40,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:38:41,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:38:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:42,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:43,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:45,346][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:46,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:46,770][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:47,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:48,197][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:49,622][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:38:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:38:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:38:51,763][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:38:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:38:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:38:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:38:54,618][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:38:55,330][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:38:56,047][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:38:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:38:57,473][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:38:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:38:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:38:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:01,274][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:01,985][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:39:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:39:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:39:05,556][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:39:06,270][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:39:06,985][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:39:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:39:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:39:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:39:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:39:10,558][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:39:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:39:11,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:39:12,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:39:14,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:39:14,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:39:14,008][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:39:15,298][__main__][INFO] - Iteration 203 took 1m 5s (23.51% Gen, 74.52% Train). Generation: 15s, Training: 49s. Estimated remaining time: 15h 1m 35s. Estimated total time: 18h 16m 16s. Time estimates for 10 more iterations: 10m 57s, 100 more iterations: 1h 49m 37s, 500 more iterations: 9h 8m 8s. [2026-03-25 17:39:15,302][__main__][INFO] - Starting iteration 203. [2026-03-25 17:39:15,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:39:15,306][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:39:20,782][__main__][INFO] - Number of regex retries in iteration 203: 0 [2026-03-25 17:39:20,783][__main__][INFO] - agents played in iteration 203 are Bob, Alice [2026-03-25 17:39:21,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:39:21,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:39:21,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:39:21,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:39:22,022][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:39:22,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:39:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:39:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:39:24,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:39:25,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:39:26,237][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:39:26,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:39:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:39:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:39:29,092][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:39:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:39:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:39:31,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:39:31,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:39:32,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:39:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:39:34,091][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:39:34,805][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:39:35,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:39:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:39:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:39:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:39:38,374][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:39:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:39:39,805][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:39:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:39:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:39:41,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:39:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:39:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:39:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:39:44,802][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:39:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:39:46,234][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:39:46,949][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:47,663][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:49,095][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:50,525][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:51,238][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:51,956][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:52,671][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:39:54,104][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:39:54,823][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:39:55,537][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:56,478][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:57,910][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:00,058][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:00,772][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:01,488][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:02,203][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:05,782][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:06,500][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:40:07,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:40:08,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:40:09,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:40:09,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:40:09,651][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:40:11,022][__main__][INFO] - Iteration 204 took 55s (9.83% Gen, 87.70% Train). Generation: 5s, Training: 48s. Estimated remaining time: 12h 13m 1s. Estimated total time: 15h 28m 38s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 19s. [2026-03-25 17:40:11,026][__main__][INFO] - Starting iteration 204. [2026-03-25 17:40:11,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:40:11,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:40:16,347][__main__][INFO] - Number of regex retries in iteration 204: 0 [2026-03-25 17:40:16,348][__main__][INFO] - agents played in iteration 204 are Bob, Alice [2026-03-25 17:40:16,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:40:16,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:40:16,970][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:40:16,971][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:40:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:40:18,312][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:40:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:40:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:40:20,456][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:40:21,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:40:21,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:40:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:40:23,318][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:40:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:40:24,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:40:25,470][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:40:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:40:32,064][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:40:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:40:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:40:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:40:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:40:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:40:36,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:40:37,076][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:40:37,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:40:38,507][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:40:39,224][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:40:39,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:40:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:40:41,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:40:42,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:40:42,804][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:40:43,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:40:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:40:44,950][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:40:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:40:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:40:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:40:47,815][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:40:48,533][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:40:49,248][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:40:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:40:50,679][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:40:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:40:52,111][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:40:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:40:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:40:54,257][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:40:54,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:40:55,690][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:40:56,406][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:40:57,454][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:40:58,168][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:40:58,886][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:40:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:41:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:41:01,039][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:41:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:41:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:03,195][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:03,915][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:41:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:41:05,350][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:41:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:41:06,785][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:41:07,503][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:41:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:08,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:09,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:52 [2026-03-25 17:41:10,691][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:10,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:10,695][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:12,036][__main__][INFO] - Iteration 205 took 1m 1s (8.71% Gen, 89.08% Train). Generation: 5s, Training: 54s. Estimated remaining time: 13h 40m 8s. Estimated total time: 16h 56m 45s. Time estimates for 10 more iterations: 10m 10s, 100 more iterations: 1h 41m 40s, 500 more iterations: 8h 28m 22s. [2026-03-25 17:41:12,039][__main__][INFO] - Starting iteration 205. [2026-03-25 17:41:12,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:41:12,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:41:17,157][__main__][INFO] - Number of regex retries in iteration 205: 0 [2026-03-25 17:41:17,158][__main__][INFO] - agents played in iteration 205 are Bob, Alice [2026-03-25 17:41:17,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:41:17,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:41:17,812][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:41:17,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:41:18,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:41:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:41:19,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:41:20,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:41:21,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:41:22,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:41:22,760][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:41:23,476][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:41:24,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:41:24,907][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:41:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:41:26,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:41:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:41:27,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:41:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:41:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:41:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:41:30,636][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:41:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:41:32,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:41:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:41:33,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:41:34,211][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:41:34,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:41:35,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:41:36,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:41:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:41:37,788][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:41:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:41:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:41:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:41:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:41:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:41:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:41:42,802][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:41:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:41:44,234][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:41:44,948][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:41:45,664][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:41:46,379][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:41:47,096][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:41:47,811][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:41:48,530][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:41:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:41:49,961][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:41:50,677][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:41:51,394][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:41:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:41:53,059][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:41:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:41:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:41:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:41:55,926][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:41:56,642][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:41:57,359][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:41:58,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:58,793][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:59,509][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:42:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:42:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:42:01,662][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:42:02,380][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:42:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:42:03,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:42:04,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:42:05,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:42:06,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:42:06,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:42:06,341][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:42:07,717][__main__][INFO] - Iteration 206 took 55s (9.18% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 10m 15s. Estimated total time: 15h 27m 48s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 54s. [2026-03-25 17:42:07,721][__main__][INFO] - Starting iteration 206. [2026-03-25 17:42:07,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:42:07,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:42:13,328][__main__][INFO] - Number of regex retries in iteration 206: 0 [2026-03-25 17:42:13,329][__main__][INFO] - agents played in iteration 206 are Bob, Alice [2026-03-25 17:42:13,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:42:13,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:42:13,915][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:42:13,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:42:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:42:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:42:16,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:42:16,717][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:42:17,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:42:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:42:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:42:19,579][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:42:20,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:42:21,013][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:42:21,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:42:22,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:42:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:42:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:42:24,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:42:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:42:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:42:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:42:27,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:42:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:42:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:42:29,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:42:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:42:31,036][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:42:31,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:42:32,467][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:42:33,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:42:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:42:34,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:42:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:42:36,046][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:42:36,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:42:37,476][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:42:38,191][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:42:38,909][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:42:39,623][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:42:40,341][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:42:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:42:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:42:42,488][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:42:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:42:43,920][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:42:44,637][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:42:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:42:46,068][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:42:46,784][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:42:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:42:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:42:49,179][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:42:49,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:42:50,613][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:42:51,329][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:42:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:42:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:42:53,482][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:42:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:42:54,915][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:42:55,631][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:42:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:42:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:42:57,786][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:42:58,503][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:42:59,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:42:59,937][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:43:00,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:43:01,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:43:02,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:43:02,453][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:43:02,454][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:43:04,331][__main__][INFO] - Iteration 207 took 56s (9.89% Gen, 86.78% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 24m 56s. Estimated total time: 15h 43m 26s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 20s, 500 more iterations: 7h 51m 43s. [2026-03-25 17:43:04,334][__main__][INFO] - Starting iteration 207. [2026-03-25 17:43:04,338][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:43:04,338][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:43:09,501][__main__][INFO] - Number of regex retries in iteration 207: 0 [2026-03-25 17:43:09,502][__main__][INFO] - agents played in iteration 207 are Bob, Alice [2026-03-25 17:43:10,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:43:10,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:43:10,075][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:43:10,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:43:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:43:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:43:12,156][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:43:12,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:43:13,586][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:43:14,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:43:15,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:43:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:43:16,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:43:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:43:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:43:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:43:19,312][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:43:20,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:43:20,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:43:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:43:22,175][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:43:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:43:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:43:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:43:25,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:43:25,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:43:26,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:43:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:43:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:43:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:43:29,336][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:43:30,054][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:43:30,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:43:31,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:43:32,202][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:43:32,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:43:33,635][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:43:34,355][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:43:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:43:35,792][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:43:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:43:37,226][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:43:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:43:38,664][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:43:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:43:40,101][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:43:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:43:41,536][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:43:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:43:42,970][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:43:43,687][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:43:44,403][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:43:45,497][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:43:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:43:46,931][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:43:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:43:48,365][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:43:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:43:49,800][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:43:50,518][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:43:51,236][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:43:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:43:52,672][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:43:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:43:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:43:54,822][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:43:55,540][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:43:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:43:56,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:43:57,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:43:58,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:43:58,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:43:58,793][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:44:00,276][__main__][INFO] - Iteration 208 took 55s (9.23% Gen, 88.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 12m 53s. Estimated total time: 15h 32m 19s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 13s, 500 more iterations: 7h 46m 9s. [2026-03-25 17:44:00,278][__main__][INFO] - Starting iteration 208. [2026-03-25 17:44:00,284][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:44:00,285][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:44:05,771][__main__][INFO] - Number of regex retries in iteration 208: 0 [2026-03-25 17:44:05,772][__main__][INFO] - agents played in iteration 208 are Bob, Alice [2026-03-25 17:44:06,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:44:06,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:44:06,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:44:06,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:44:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:44:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:44:08,384][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:44:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:44:09,817][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:44:10,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:44:11,250][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:44:11,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:44:12,682][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:44:13,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:44:14,114][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:44:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:44:15,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:44:16,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:44:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:44:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:44:18,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:44:19,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:44:19,846][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:44:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:44:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:44:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:44:22,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:44:23,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:44:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:44:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:44:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:44:26,295][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:44:27,013][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:44:27,729][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:44:28,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:44:29,162][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:44:29,878][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:44:30,594][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:44:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:44:32,028][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:44:32,746][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:44:33,462][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:44:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:44:34,898][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:44:35,615][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:44:36,331][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:44:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:44:37,765][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:44:38,481][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:44:39,199][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:44:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:44:40,634][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:44:41,584][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:44:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:44:43,017][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:44:43,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:44:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:44:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:44:45,886][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:44:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:44:47,323][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:44:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:44:48,759][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:44:49,477][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:44:50,196][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:44:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:44:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:44:52,354][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:44:53,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:44:53,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:44:54,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:44:54,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:44:54,814][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:44:56,314][__main__][INFO] - Iteration 209 took 56s (9.79% Gen, 87.52% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 13m 30s. Estimated total time: 15h 33m 52s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 56s. [2026-03-25 17:44:56,317][__main__][INFO] - Starting iteration 209. [2026-03-25 17:44:56,322][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:44:56,323][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:45:09,108][__main__][INFO] - Number of regex retries in iteration 209: 0 [2026-03-25 17:45:09,109][__main__][INFO] - agents played in iteration 209 are Bob, Alice [2026-03-25 17:45:09,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:45:09,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:45:09,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:45:09,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:45:10,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:45:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:45:11,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:45:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:45:13,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:45:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:45:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:45:15,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:45:16,073][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:45:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:45:17,505][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:45:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:45:18,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:45:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:45:20,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:45:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:45:21,805][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:45:22,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:45:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:45:23,955][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:45:24,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:45:25,505][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:45:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:45:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:45:27,693][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:45:28,444][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:45:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:45:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:45:30,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:45:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:45:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:45:32,742][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:45:33,459][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:45:34,173][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:45:34,888][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:45:35,605][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:45:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:45:37,037][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:45:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:45:38,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:45:39,191][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:45:39,909][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:45:40,628][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:45:41,346][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:45:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:45:42,786][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:45:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:45:44,225][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:45:45,239][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:45:45,957][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:45:46,677][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:45:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:45:48,118][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:45:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:45:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:45:50,280][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:45:51,000][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:45:51,720][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:45:52,439][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:45:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:45:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:45:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:45:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:45:56,035][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:45:56,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:45:57,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 17:45:58,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:45:58,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:45:58,658][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:00,004][__main__][INFO] - Iteration 210 took 1m 3s (20.08% Gen, 77.80% Train). Generation: 12s, Training: 49s. Estimated remaining time: 14h 19m 59s. Estimated total time: 17h 41m 24s. Time estimates for 10 more iterations: 10m 36s, 100 more iterations: 1h 46m 8s, 500 more iterations: 8h 50m 42s. [2026-03-25 17:46:00,015][__main__][INFO] - Starting iteration 210. [2026-03-25 17:46:00,031][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:46:00,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:46:05,450][__main__][INFO] - Number of regex retries in iteration 210: 0 [2026-03-25 17:46:05,451][__main__][INFO] - agents played in iteration 210 are Bob, Alice [2026-03-25 17:46:05,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:46:06,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:46:06,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:46:06,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:46:06,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:46:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:46:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:46:08,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:46:09,623][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:46:10,342][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:46:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:46:11,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:46:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:46:13,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:46:13,935][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:46:14,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:46:15,374][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:46:16,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:46:16,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:46:17,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:46:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:46:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:46:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:46:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:46:21,136][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:46:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:46:22,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:46:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:46:24,012][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:46:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:46:25,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:46:26,172][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:46:26,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:46:27,610][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:46:28,327][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:46:29,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:46:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:46:30,475][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:46:31,192][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:46:31,907][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:46:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:46:33,341][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:46:34,059][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:46:34,773][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:46:35,490][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:46:36,208][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:46:36,926][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:46:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:46:38,360][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:46:39,075][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:46:39,794][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:46:40,510][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:46:41,550][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:46:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:46:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:46:43,699][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:46:44,416][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:46:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:46:45,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:46:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:46:47,285][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:46:48,002][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:46:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:46:49,439][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:46:50,157][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:46:50,873][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:46:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:46:52,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:46:53,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:46:53,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:46:55,035][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:46:55,040][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:46:55,042][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:56,610][__main__][INFO] - Iteration 211 took 56s (9.58% Gen, 87.65% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 20m 38s. Estimated total time: 15h 43m 1s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 18s, 500 more iterations: 7h 51m 30s. [2026-03-25 17:46:56,614][__main__][INFO] - Starting iteration 211. [2026-03-25 17:46:56,621][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:46:56,622][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:47:01,785][__main__][INFO] - Number of regex retries in iteration 211: 0 [2026-03-25 17:47:01,786][__main__][INFO] - agents played in iteration 211 are Bob, Alice [2026-03-25 17:47:02,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:47:02,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:47:02,343][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:47:02,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:47:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:47:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:47:04,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:47:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:47:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:47:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:47:07,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:47:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:47:08,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:47:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:47:10,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:47:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:47:11,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:47:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:47:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:47:13,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:47:14,427][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:47:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:47:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:47:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:47:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:47:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:47:18,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:47:19,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:47:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:47:20,874][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:47:21,590][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:47:22,306][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:47:23,023][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:47:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:47:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:47:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:47:25,889][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:47:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:47:27,324][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:47:28,040][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:47:28,758][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:47:29,475][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:47:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:47:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:47:31,627][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:47:32,343][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:47:33,058][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:47:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:47:34,492][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:47:35,209][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:47:35,926][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:47:36,644][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:47:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:47:38,306][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:47:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:47:39,743][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:47:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:47:41,177][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:47:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:47:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:47:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:47:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:47:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:47:45,483][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:47:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:47:46,918][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:47:47,635][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:47:48,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:47:49,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:47:49,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:47:51,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:47:51,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:47:51,044][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:47:52,371][__main__][INFO] - Iteration 212 took 55s (9.26% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 5m 55s. Estimated total time: 15h 29m 13s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 36s. [2026-03-25 17:47:52,374][__main__][INFO] - Starting iteration 212. [2026-03-25 17:47:52,379][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:47:52,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:47:57,685][__main__][INFO] - Number of regex retries in iteration 212: 0 [2026-03-25 17:47:57,687][__main__][INFO] - agents played in iteration 212 are Bob, Alice [2026-03-25 17:47:58,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:47:58,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:47:58,323][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:47:58,324][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:47:59,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:47:59,658][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:48:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:48:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:48:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:48:02,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:48:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:48:03,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:48:04,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:48:05,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:48:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:48:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:48:07,537][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:48:08,253][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:48:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:48:09,687][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:48:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:48:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:48:11,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:48:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:48:13,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:48:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:48:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:48:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:48:16,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:48:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:48:17,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:48:18,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:48:19,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:48:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:48:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:48:21,157][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:48:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:48:26,178][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:48:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:48:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:48:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:48:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:48:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:48:30,483][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:48:31,201][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:48:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:48:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:48:33,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:48:34,311][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:48:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:48:35,746][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:48:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:48:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:48:37,899][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:48:38,617][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:48:39,338][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:48:40,054][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:48:40,772][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:48:41,491][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:48:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:48:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:48:43,645][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:48:44,361][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:48:45,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:48:45,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:48:47,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:48:47,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:48:47,130][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:48:49,394][__main__][INFO] - Iteration 213 took 57s (9.31% Gen, 86.71% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 26m 3s. Estimated total time: 15h 50m 18s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 1s, 500 more iterations: 7h 55m 9s. [2026-03-25 17:48:49,399][__main__][INFO] - Starting iteration 213. [2026-03-25 17:48:49,404][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:48:49,405][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:00,533][__main__][INFO] - Number of regex retries in iteration 213: 0 [2026-03-25 17:49:00,534][__main__][INFO] - agents played in iteration 213 are Bob, Alice [2026-03-25 17:49:01,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:49:01,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:49:01,135][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:49:01,136][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:49:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:49:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:49:03,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:49:03,895][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:49:04,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:49:05,323][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:49:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:49:06,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:49:07,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:49:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:49:08,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:49:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:49:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:49:11,034][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:49:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:49:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:49:13,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:49:13,898][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:49:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:49:15,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:49:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:49:16,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:49:17,477][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:49:18,193][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:49:18,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:49:19,624][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:49:20,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:49:21,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:49:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:49:22,488][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:49:23,205][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:49:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:49:24,639][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:49:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:49:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:49:26,787][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:49:27,506][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:49:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:49:28,939][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:29,658][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:30,373][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:31,093][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:31,808][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:32,525][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:49:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:49:33,956][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:49:34,672][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:49:35,389][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:49:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:49:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:49:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:49:38,576][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:49:39,293][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:49:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:49:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:49:41,443][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:49:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:49:42,876][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:49:43,593][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:49:44,309][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:49:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:49:45,743][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:49:46,464][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:49:47,179][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:49:47,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:49:48,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:49:49,892][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:49:49,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:49:49,898][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:49:51,173][__main__][INFO] - Iteration 214 took 1m 1s (18.01% Gen, 79.91% Train). Generation: 11s, Training: 49s. Estimated remaining time: 13h 44m 15s. Estimated total time: 17h 9m 32s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 57s, 500 more iterations: 8h 34m 46s. [2026-03-25 17:49:51,176][__main__][INFO] - Starting iteration 214. [2026-03-25 17:49:51,181][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:49:51,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:56,573][__main__][INFO] - Number of regex retries in iteration 214: 0 [2026-03-25 17:49:56,574][__main__][INFO] - agents played in iteration 214 are Bob, Alice [2026-03-25 17:49:57,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:49:57,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:49:57,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:49:57,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:49:57,832][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:49:58,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:49:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:49:59,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:00,628][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:02,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:02,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:03,491][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:50:04,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:50:04,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:50:05,638][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:50:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:50:07,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:50:07,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:50:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:50:09,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:50:09,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:50:10,655][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:50:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:50:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:50:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:50:13,519][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:50:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:50:14,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:50:15,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:50:16,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:50:17,098][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:50:17,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:50:18,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:50:19,244][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:50:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:50:20,675][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:50:21,393][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:50:22,109][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:50:22,827][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:50:23,544][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:50:24,260][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:50:24,976][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:50:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:50:26,409][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:50:27,124][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:50:27,842][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:50:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:50:29,275][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:50:29,991][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:50:30,709][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:50:31,426][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:50:32,377][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:50:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:50:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:50:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:50:35,243][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:50:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:50:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:50:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:50:38,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:50:38,830][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:50:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:50:40,265][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:50:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:50:41,700][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:50:42,418][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:50:43,135][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:50:43,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:50:44,577][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:50:45,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:50:45,819][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:50:45,821][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:50:47,242][__main__][INFO] - Iteration 215 took 56s (9.62% Gen, 87.84% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 8m 11s. Estimated total time: 15h 34m 24s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 12s. [2026-03-25 17:50:47,245][__main__][INFO] - Starting iteration 215. [2026-03-25 17:50:47,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:50:47,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:50:52,378][__main__][INFO] - Number of regex retries in iteration 215: 0 [2026-03-25 17:50:52,379][__main__][INFO] - agents played in iteration 215 are Bob, Alice [2026-03-25 17:50:52,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:50:52,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:50:52,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:50:52,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:50:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:50:54,287][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:50:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:50:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:56,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:57,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:57,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:58,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:59,300][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:51:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:02,166][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:02,881][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:03,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:51:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:51:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:51:05,748][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:51:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:51:07,181][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:51:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:51:08,614][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:51:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:51:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:51:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:51:11,483][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:51:12,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:51:12,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:51:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:51:14,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:51:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:51:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:51:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:51:17,225][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:51:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:51:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:51:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:51:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:51:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:51:21,525][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:51:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:51:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:51:23,674][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:51:24,391][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:51:25,107][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:51:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:51:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:51:27,260][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:51:28,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:51:28,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:51:29,640][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:51:30,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:51:31,073][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:51:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:51:32,507][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:51:33,226][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:51:33,942][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:51:34,660][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:51:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:51:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:51:36,814][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:51:37,530][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:51:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:51:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:51:39,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:51:40,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:51:41,573][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:51:41,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:51:41,578][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:51:42,846][__main__][INFO] - Iteration 216 took 55s (9.23% Gen, 88.49% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 59m 29s. Estimated total time: 15h 26m 38s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 19s. [2026-03-25 17:51:42,849][__main__][INFO] - Starting iteration 216. [2026-03-25 17:51:42,853][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:51:42,898][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:51:48,139][__main__][INFO] - Number of regex retries in iteration 216: 0 [2026-03-25 17:51:48,141][__main__][INFO] - agents played in iteration 216 are Bob, Alice [2026-03-25 17:51:48,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:51:48,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:51:48,702][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:51:48,703][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:51:49,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:51:50,034][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:51:50,750][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:51:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:51:52,185][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:51:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:51:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:51:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:51:55,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:51:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:56,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:58,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:59,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:00,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:01,492][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:52:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:52:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:52:03,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:52:04,356][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:52:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:52:05,788][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:52:06,504][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:52:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:52:07,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:52:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:52:09,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:52:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:52:10,805][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:52:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:52:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:52:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:52:13,677][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:52:14,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:52:15,111][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:52:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:52:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:52:17,261][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:52:17,980][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:52:18,697][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:52:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:52:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:52:20,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:52:21,565][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:52:22,283][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:52:23,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:52:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:52:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:52:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:52:26,181][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:52:26,900][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:52:27,618][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:52:28,336][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:52:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:52:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:52:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:52:31,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:52:31,923][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:52:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:52:33,360][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:52:34,078][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:52:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:52:35,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:52:36,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:52:37,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:52:37,505][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:52:37,506][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:52:39,124][__main__][INFO] - Iteration 217 took 56s (9.32% Gen, 87.72% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 9m 48s. Estimated total time: 15h 37m 53s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 56s. [2026-03-25 17:52:39,127][__main__][INFO] - Starting iteration 217. [2026-03-25 17:52:39,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:52:39,132][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:52:46,387][__main__][INFO] - Number of regex retries in iteration 217: 0 [2026-03-25 17:52:46,389][__main__][INFO] - agents played in iteration 217 are Bob, Alice [2026-03-25 17:52:46,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:52:46,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:52:46,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:52:46,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:52:47,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:52:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:52:49,010][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:52:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:52:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:52:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:52:51,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:52:52,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:52:53,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:52:54,016][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:52:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:52:55,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:52:56,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:52:56,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:52:57,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:58,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:59,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:53:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:53:01,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:53:01,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:53:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:53:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:53:04,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:53:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:53:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:53:06,194][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:53:06,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:53:07,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:53:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:53:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:53:09,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:53:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:53:11,208][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:53:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:53:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:53:13,358][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:53:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:53:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:53:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:53:16,223][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:53:16,941][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:53:17,657][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:53:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:53:19,091][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:53:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:53:20,524][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:53:21,241][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:53:22,186][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:53:22,903][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:53:23,620][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:53:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:53:25,055][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:53:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:53:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:53:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:53:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:53:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:53:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:53:30,077][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:53:30,795][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:53:31,513][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:53:32,231][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:53:32,948][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:53:33,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:53:34,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:53:35,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:53:35,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:53:35,651][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:53:38,742][__main__][INFO] - Iteration 218 took 59s (12.17% Gen, 82.64% Train). Generation: 7s, Training: 49s. Estimated remaining time: 13h 4m 29s. Estimated total time: 16h 33m 33s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 21s, 500 more iterations: 8h 16m 46s. [2026-03-25 17:53:38,746][__main__][INFO] - Starting iteration 218. [2026-03-25 17:53:38,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:53:38,752][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:53:48,093][__main__][INFO] - Number of regex retries in iteration 218: 0 [2026-03-25 17:53:48,094][__main__][INFO] - agents played in iteration 218 are Bob, Alice [2026-03-25 17:53:48,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:53:48,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:53:48,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:53:48,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:53:49,372][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:53:50,017][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:53:50,732][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:53:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:53:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:53:52,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:53:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:53:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:53:55,022][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:53:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:53:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:53:57,172][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:53:57,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:53:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:53:59,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:54:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:54:00,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:01,467][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:02,181][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:03,612][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:04,327][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:54:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:54:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:54:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:54:07,188][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:54:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:54:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:54:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:54:10,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:54:10,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:54:11,484][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:54:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:54:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:54:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:54:14,349][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:54:15,064][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:54:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:54:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:54:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:54:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:54:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:54:19,366][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:54:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:54:20,801][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:54:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:54:22,231][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:54:22,948][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:54:23,904][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:54:24,621][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:25,336][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:26,772][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:54:28,206][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:54:28,924][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:54:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:54:30,357][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:54:31,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:54:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:54:32,509][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:54:33,226][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:54:33,943][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:54:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:54:35,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:54:36,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:54:37,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:54:37,218][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:54:37,220][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:54:38,516][__main__][INFO] - Iteration 219 took 59s (15.63% Gen, 82.19% Train). Generation: 9s, Training: 49s. Estimated remaining time: 13h 6m 4s. Estimated total time: 16h 36m 8s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 36s, 500 more iterations: 8h 18m 4s. [2026-03-25 17:54:38,519][__main__][INFO] - Starting iteration 219. [2026-03-25 17:54:38,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:54:38,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:54:43,601][__main__][INFO] - Number of regex retries in iteration 219: 0 [2026-03-25 17:54:43,602][__main__][INFO] - agents played in iteration 219 are Bob, Alice [2026-03-25 17:54:44,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:54:44,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:54:44,238][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:54:44,239][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:54:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:54:45,571][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:54:46,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:54:47,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:54:47,718][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:54:48,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:54:49,148][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:54:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:54:50,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:54:51,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:54:52,010][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:54:52,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:54:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:54:54,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:54:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:54:55,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:54:56,308][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:57,030][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:58,464][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:55:00,615][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:55:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:55:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:55:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:55:03,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:55:04,200][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:55:04,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:55:05,632][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:55:06,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:55:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:55:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:55:08,499][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:55:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:55:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:55:10,650][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:55:11,368][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:55:12,084][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:55:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:55:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:55:14,236][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:55:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:55:15,669][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:55:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:55:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:55:17,819][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:55:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:55:19,552][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:55:20,267][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:55:20,984][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:55:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:55:22,417][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:55:23,134][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:55:23,854][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:55:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:55:25,287][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:55:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:26,723][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:28,160][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:28,877][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:55:29,594][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:55:30,313][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:55:31,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:55:31,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:55:32,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:55:32,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:55:32,946][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:55:38,452][__main__][INFO] - Iteration 220 took 59s (8.47% Gen, 82.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 13h 7m 46s. Estimated total time: 16h 38m 50s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 53s, 500 more iterations: 8h 19m 25s. [2026-03-25 17:55:38,454][__main__][INFO] - Starting iteration 220. [2026-03-25 17:55:38,459][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:55:38,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:55:43,743][__main__][INFO] - Number of regex retries in iteration 220: 0 [2026-03-25 17:55:43,744][__main__][INFO] - agents played in iteration 220 are Bob, Alice [2026-03-25 17:55:44,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:55:44,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:55:44,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:55:44,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:55:45,034][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:55:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:55:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:55:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:55:47,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:55:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:55:49,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:55:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:55:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:55:51,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:55:52,122][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:55:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:55:53,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:55:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:55:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:55:55,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:55:56,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:55:57,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:55:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:55:58,569][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:55:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:56:00,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:56:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:56:01,435][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:56:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:56:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:56:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:56:04,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:56:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:56:05,737][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:56:06,453][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:56:07,171][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:56:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:56:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:56:09,325][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:56:10,041][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:56:10,760][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:56:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:56:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:56:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:56:13,630][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:56:14,348][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:56:15,064][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:56:15,781][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:56:16,498][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:56:17,215][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:56:17,933][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:56:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:56:19,603][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:56:20,323][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:56:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:56:21,755][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:56:22,473][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:56:23,190][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:56:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:56:24,626][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:56:25,344][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:56:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:56:26,778][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:56:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:56:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:56:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:56:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:56:30,373][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:56:31,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:56:31,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:56:32,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:56:32,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:56:32,941][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:56:34,499][__main__][INFO] - Iteration 221 took 56s (9.43% Gen, 87.78% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 2m 2s. Estimated total time: 15h 34m 3s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 24s, 500 more iterations: 7h 47m 1s. [2026-03-25 17:56:34,503][__main__][INFO] - Starting iteration 221. [2026-03-25 17:56:34,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:56:34,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:56:39,653][__main__][INFO] - Number of regex retries in iteration 221: 0 [2026-03-25 17:56:39,654][__main__][INFO] - agents played in iteration 221 are Bob, Alice [2026-03-25 17:56:40,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:56:40,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:56:40,222][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:56:40,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:56:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:56:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:56:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:56:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:56:43,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:56:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:56:45,151][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:56:45,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:56:46,586][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:56:47,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:56:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:56:48,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:56:49,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:56:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:56:50,890][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:56:51,608][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:56:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:56:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:56:53,760][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:56:54,478][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:56:55,195][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:56:55,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:56:56,628][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:56:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:56:58,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:56:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:56:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:57:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:57:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:57:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:57:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:57:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:57:03,814][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:57:04,534][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:57:05,253][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:57:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:57:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:57:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:57:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:57:08,850][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:57:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:57:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:57:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:57:11,730][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:57:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:57:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:57:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:57:14,611][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:57:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:57:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:57:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:57:17,741][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:57:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:57:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:57:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:57:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:57:21,331][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:57:22,050][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:57:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:57:23,486][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:57:24,206][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:57:24,925][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:57:25,643][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:57:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:57:27,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:57:27,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:57:28,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:57:28,910][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:57:28,912][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:57:30,347][__main__][INFO] - Iteration 222 took 55s (9.21% Gen, 88.21% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 57m 46s. Estimated total time: 15h 30m 42s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 21s. [2026-03-25 17:57:30,350][__main__][INFO] - Starting iteration 222. [2026-03-25 17:57:30,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:57:30,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:57:35,513][__main__][INFO] - Number of regex retries in iteration 222: 0 [2026-03-25 17:57:35,514][__main__][INFO] - agents played in iteration 222 are Bob, Alice [2026-03-25 17:57:36,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:57:36,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:57:36,089][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:57:36,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:57:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:57:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:57:38,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:57:38,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:57:39,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:57:40,296][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:57:41,012][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:57:41,730][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:57:42,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:57:43,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:57:43,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:57:44,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:57:45,313][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:57:46,030][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:57:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:57:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:57:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:57:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:57:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:57:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:57:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:57:52,054][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:57:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:57:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:57:54,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:57:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:57:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:57:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:00,678][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:02,814][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:58:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:58:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:58:05,681][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:58:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:58:07,114][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:58:07,830][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:58:08,546][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:58:09,263][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:58:09,983][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:58:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:58:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:58:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:58:12,851][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:58:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:58:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:58:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:58:16,008][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:58:16,725][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:58:17,441][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:58:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:58:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:58:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:58:20,307][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:58:21,023][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:58:21,741][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:58:22,457][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:58:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:58:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:58:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:58:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:58:26,041][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:58:26,759][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:58:27,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:58:28,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 17:58:29,237][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:58:29,240][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:58:29,241][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:58:31,258][__main__][INFO] - Iteration 223 took 1m 0s (8.47% Gen, 88.21% Train). Generation: 5s, Training: 53s. Estimated remaining time: 13h 21m 8s. Estimated total time: 16h 55m 5s. Time estimates for 10 more iterations: 10m 9s, 100 more iterations: 1h 41m 30s, 500 more iterations: 8h 27m 32s. [2026-03-25 17:58:31,261][__main__][INFO] - Starting iteration 223. [2026-03-25 17:58:31,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:58:31,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:58:36,545][__main__][INFO] - Number of regex retries in iteration 223: 0 [2026-03-25 17:58:36,546][__main__][INFO] - agents played in iteration 223 are Bob, Alice [2026-03-25 17:58:37,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:58:37,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:58:37,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:58:37,111][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:58:37,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:58:38,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:58:39,167][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:58:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:58:40,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:58:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:58:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:58:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:58:43,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:58:44,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:58:44,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:58:45,608][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:58:46,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:58:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:58:47,757][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:58:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:58:49,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:58:49,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:58:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:58:51,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:58:52,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:58:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:58:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:58:54,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:58:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:58:55,638][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:58:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:58:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:58,509][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:59,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:59:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:59:01,375][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:59:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:59:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:59:03,526][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:59:04,241][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:59:04,960][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:59:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:59:06,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:59:07,111][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:59:07,830][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:59:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:59:09,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:59:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:59:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:59:11,415][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:59:12,370][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:59:13,088][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:59:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:59:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:59:15,240][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:59:15,957][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:59:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:59:17,391][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:59:18,110][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:59:18,828][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:59:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:59:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:59:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:59:21,699][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:59:22,418][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:59:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:59:23,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:59:24,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 17:59:25,572][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:59:25,574][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:59:25,575][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:59:27,025][__main__][INFO] - Iteration 224 took 55s (9.47% Gen, 87.93% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 54m 29s. Estimated total time: 15h 29m 22s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 41s. [2026-03-25 17:59:27,028][__main__][INFO] - Starting iteration 224. [2026-03-25 17:59:27,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:59:27,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:59:32,554][__main__][INFO] - Number of regex retries in iteration 224: 0 [2026-03-25 17:59:32,555][__main__][INFO] - agents played in iteration 224 are Bob, Alice [2026-03-25 17:59:33,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:59:33,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 17:59:33,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 17:59:33,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 17:59:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:59:34,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:59:35,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:59:35,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:59:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:59:37,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:59:38,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:59:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:59:39,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:59:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:59:40,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:59:41,631][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:59:42,349][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:59:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:59:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:59:44,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:59:45,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:59:45,930][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:59:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:59:47,365][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:59:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:59:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:59:49,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:59:50,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:59:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:59:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:59:52,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:59:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:59:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:59:54,530][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:59:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:59:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:59:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:59:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:59:58,119][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:59:58,835][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:59:59,553][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:00:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:00:00,987][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:00:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:00:02,421][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:00:03,137][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:00:03,855][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:00:04,571][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:00:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:00:06,007][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:00:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:00:07,443][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:00:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:00:09,114][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:00:09,834][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:00:10,553][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:00:11,273][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:00:11,994][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:00:12,713][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:00:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:00:14,155][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:00:14,873][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:00:15,593][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:00:16,314][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:00:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:00:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:00:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:00:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:00:19,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:00:20,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:00:21,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:00:21,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:00:21,709][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:00:23,145][__main__][INFO] - Iteration 225 took 56s (9.84% Gen, 87.59% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 59m 26s. Estimated total time: 15h 35m 15s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 31s, 500 more iterations: 7h 47m 37s. [2026-03-25 18:00:23,148][__main__][INFO] - Starting iteration 225. [2026-03-25 18:00:23,153][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:00:23,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:00:28,657][__main__][INFO] - Number of regex retries in iteration 225: 0 [2026-03-25 18:00:28,658][__main__][INFO] - agents played in iteration 225 are Bob, Alice [2026-03-25 18:00:29,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:00:29,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:00:29,216][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:00:29,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:00:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:00:30,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:00:31,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:00:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:00:32,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:00:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:00:34,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:00:34,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:00:35,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:00:36,285][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:00:37,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:00:37,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:00:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:00:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:00:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:00:40,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:00:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:00:42,017][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:00:42,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:00:43,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:00:44,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:00:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:00:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:00:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:00:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:00:47,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:00:48,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:00:49,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:00:49,902][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:00:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:00:51,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:00:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:00:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:00:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:00:54,203][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:00:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:00:55,636][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:00:56,355][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:00:57,070][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:00:57,790][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:00:58,507][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:00:59,224][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:00:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:01:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:01:01,379][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:01:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:01:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:01:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:01:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:01:05,261][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:01:05,981][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:01:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:01:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:01:08,135][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:01:08,854][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:01:09,572][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:01:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:01:11,008][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:01:11,727][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:01:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:01:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:01:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:01:14,596][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:01:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:01:16,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:01:16,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:01:17,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:01:17,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:01:17,779][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:01:19,172][__main__][INFO] - Iteration 226 took 56s (9.83% Gen, 87.68% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 56m 56s. Estimated total time: 15h 33m 41s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 50s. [2026-03-25 18:01:19,175][__main__][INFO] - Starting iteration 226. [2026-03-25 18:01:19,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:01:19,179][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:01:24,256][__main__][INFO] - Number of regex retries in iteration 226: 0 [2026-03-25 18:01:24,257][__main__][INFO] - agents played in iteration 226 are Bob, Alice [2026-03-25 18:01:24,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:01:24,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:01:24,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:01:24,843][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:01:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:01:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:01:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:01:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:01:29,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:01:29,758][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:01:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:01:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:01:31,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:01:32,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:01:33,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:01:34,063][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:01:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:01:35,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:01:36,215][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:01:36,932][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:01:37,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:01:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:01:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:01:39,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:01:40,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:01:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:01:41,952][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:01:42,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:01:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:01:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:01:44,820][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:01:45,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:01:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:01:46,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:01:47,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:01:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:01:49,118][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:01:49,835][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:01:50,552][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:01:51,268][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:01:51,984][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:01:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:01:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:01:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:01:54,853][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:01:55,569][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:01:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:01:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:01:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:01:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:01:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:02:00,131][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:02:00,850][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:02:01,570][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:02:02,285][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:02:03,004][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:02:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:02:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:02:05,154][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:02:05,871][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:02:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:02:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:02:09,492][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:02:10,613][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:02:11,330][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:02:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:02:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:02:13,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:02:14,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 18:02:15,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:02:15,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:02:15,546][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:02:17,079][__main__][INFO] - Iteration 227 took 57s (8.77% Gen, 88.58% Train). Generation: 5s, Training: 51s. Estimated remaining time: 12h 27m 19s. Estimated total time: 16h 5m 2s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 30s, 500 more iterations: 8h 2m 31s. [2026-03-25 18:02:17,082][__main__][INFO] - Starting iteration 227. [2026-03-25 18:02:17,087][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:02:17,087][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:02:22,728][__main__][INFO] - Number of regex retries in iteration 227: 0 [2026-03-25 18:02:22,729][__main__][INFO] - agents played in iteration 227 are Bob, Alice [2026-03-25 18:02:23,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:02:23,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:02:23,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:02:23,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:02:24,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:02:24,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:02:25,411][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:02:26,124][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:02:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:02:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:02:28,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:02:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:02:29,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:02:30,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:02:31,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:02:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:02:32,560][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:34,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:35,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:36,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:38,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:39,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:02:39,719][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:02:40,436][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:02:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:02:41,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:02:42,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:02:43,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:02:44,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:02:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:02:45,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:02:46,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:02:46,886][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:02:47,603][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:02:48,320][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:02:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:02:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:02:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:02:51,183][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:02:51,898][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:02:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:02:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:02:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:02:54,764][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:02:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:02:56,197][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:02:56,914][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:02:57,630][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:02:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:02:59,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:03:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:03:00,726][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:03:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:03:02,160][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:03:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:03:03,597][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:03:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:03:05,031][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:03:05,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:03:06,467][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:03:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:03:07,902][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:03:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:03:09,339][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:03:10,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:03:10,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:03:11,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:03:11,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:03:11,768][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:03:13,237][__main__][INFO] - Iteration 228 took 56s (10.05% Gen, 87.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 57m 13s. Estimated total time: 15h 35m 52s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 56s. [2026-03-25 18:03:13,241][__main__][INFO] - Starting iteration 228. [2026-03-25 18:03:13,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:03:13,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:03:22,627][__main__][INFO] - Number of regex retries in iteration 228: 0 [2026-03-25 18:03:22,628][__main__][INFO] - agents played in iteration 228 are Bob, Alice [2026-03-25 18:03:23,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:03:23,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:03:23,228][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:03:23,229][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:03:23,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:03:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:03:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:03:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:03:26,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:03:27,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:03:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:03:28,903][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:03:29,617][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:03:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:03:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:03:31,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:03:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:03:33,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:03:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:03:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:03:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:03:36,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:03:36,770][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:03:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:03:38,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:03:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:03:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:40,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:41,066][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:41,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:43,215][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:44,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:45,364][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:46,082][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:47,515][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:48,230][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:51,097][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:03:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:03:52,535][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:03:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:03:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:03:54,690][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:03:55,407][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:03:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:03:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:03:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:03:58,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:03:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:00,039][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:00,754][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:01,470][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:02,186][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:02,902][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:04:03,619][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:04:04,338][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:04:05,054][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:04:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:04:06,489][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:04:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:04:07,923][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:04:08,641][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:04:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:04:10,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:04:10,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:04:11,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:04:11,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:04:11,839][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:04:13,306][__main__][INFO] - Iteration 229 took 1m 0s (15.61% Gen, 81.94% Train). Generation: 9s, Training: 49s. Estimated remaining time: 13h 1m 22s. Estimated total time: 16h 41m 1s. Time estimates for 10 more iterations: 10m 0s, 100 more iterations: 1h 40m 6s, 500 more iterations: 8h 20m 30s. [2026-03-25 18:04:13,309][__main__][INFO] - Starting iteration 229. [2026-03-25 18:04:13,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:04:13,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:04:18,392][__main__][INFO] - Number of regex retries in iteration 229: 0 [2026-03-25 18:04:18,393][__main__][INFO] - agents played in iteration 229 are Bob, Alice [2026-03-25 18:04:18,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:04:18,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:04:18,947][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:04:18,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:04:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:04:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:04:21,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:04:21,728][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:04:22,443][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:04:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:04:23,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:04:24,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:04:25,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:04:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:04:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:04:27,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:04:28,172][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:04:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:04:29,608][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:04:30,325][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:04:31,044][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:04:31,761][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:04:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:04:33,197][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:04:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:04:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:04:35,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:04:36,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:04:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:04:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:04:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:04:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:04:39,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:04:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:04:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:04:41,812][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:04:42,528][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:04:43,246][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:04:43,962][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:04:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:04:45,397][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:04:46,116][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:04:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:48,269][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:48,989][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:49,707][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:50,426][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:51,860][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:52,578][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:54,271][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:54,989][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:56,612][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:58,765][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:59,483][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:00,200][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:01,632][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:02,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:05:05,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:05:05,938][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:05:06,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:05:07,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:05:08,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:05:08,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:05:08,401][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:05:10,361][__main__][INFO] - Iteration 230 took 57s (8.90% Gen, 87.66% Train). Generation: 5s, Training: 50s. Estimated remaining time: 12h 10m 14s. Estimated total time: 15h 50m 50s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 5s, 500 more iterations: 7h 55m 25s. [2026-03-25 18:05:10,364][__main__][INFO] - Starting iteration 230. [2026-03-25 18:05:10,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:05:10,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:05:15,892][__main__][INFO] - Number of regex retries in iteration 230: 0 [2026-03-25 18:05:15,893][__main__][INFO] - agents played in iteration 230 are Bob, Alice [2026-03-25 18:05:16,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:05:16,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:05:16,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:05:16,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:05:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:05:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:05:18,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:05:19,227][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:05:19,945][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:05:20,663][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:05:21,378][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:05:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:05:22,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:05:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:05:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:05:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:05:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:05:26,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:05:27,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:05:27,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:05:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:05:29,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:05:29,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:05:30,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:05:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:05:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:05:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:05:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:05:34,281][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:05:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:05:35,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:05:36,432][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:05:37,148][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:05:37,865][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:05:38,582][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:05:39,301][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:05:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:05:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:05:41,454][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:05:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:05:42,890][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:05:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:05:44,327][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:05:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:05:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:05:46,482][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:05:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:05:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:05:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:05:49,359][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:05:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:05:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:05:51,746][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:05:52,464][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:05:53,181][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:05:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:05:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:05:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:05:56,050][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:56,767][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:57,485][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:58,202][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:59,637][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:00,358][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:06:01,079][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:02,516][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:03,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:06:03,976][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:06:05,095][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:06:05,099][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:06:05,101][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:06:06,467][__main__][INFO] - Iteration 231 took 56s (9.85% Gen, 87.71% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 53m 29s. Estimated total time: 15h 35m 1s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 30s, 500 more iterations: 7h 47m 30s. [2026-03-25 18:06:06,470][__main__][INFO] - Starting iteration 231. [2026-03-25 18:06:06,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:06:06,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:06:11,846][__main__][INFO] - Number of regex retries in iteration 231: 0 [2026-03-25 18:06:11,847][__main__][INFO] - agents played in iteration 231 are Bob, Alice [2026-03-25 18:06:12,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:06:12,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:06:12,428][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:06:12,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:06:13,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:06:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:06:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:06:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:06:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:06:16,707][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:06:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:06:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:06:18,860][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:06:19,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:06:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:06:21,014][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:06:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:06:22,450][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:06:23,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:06:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:06:24,604][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:06:25,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:06:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:06:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:06:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:06:28,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:06:28,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:06:29,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:06:30,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:06:31,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:06:31,795][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:06:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:06:33,232][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:06:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:06:34,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:06:35,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:06:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:06:36,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:06:37,545][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:06:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:06:38,983][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:06:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:06:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:06:41,141][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:06:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:06:42,579][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:06:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:06:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:06:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:06:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:06:46,165][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:06:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:06:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:06:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:06:49,335][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:06:50,053][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:06:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:06:51,489][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:06:52,205][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:06:52,924][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:06:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:06:54,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:06:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:06:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:06:57,232][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:58,672][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:59,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:00,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:07:01,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:01,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:01,283][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:02,967][__main__][INFO] - Iteration 232 took 56s (9.51% Gen, 87.50% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 59m 6s. Estimated total time: 15h 41m 34s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 47s. [2026-03-25 18:07:02,970][__main__][INFO] - Starting iteration 232. [2026-03-25 18:07:02,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:07:02,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:07:08,193][__main__][INFO] - Number of regex retries in iteration 232: 0 [2026-03-25 18:07:08,194][__main__][INFO] - agents played in iteration 232 are Bob, Alice [2026-03-25 18:07:08,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:07:08,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:07:08,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:07:08,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:07:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:07:10,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:07:10,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:07:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:07:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:07:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:07:13,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:07:14,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:07:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:07:15,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:07:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:07:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:07:18,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:07:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:07:19,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:07:20,157][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:07:20,874][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:07:21,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:07:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:07:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:07:23,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:07:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:07:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:07:25,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:07:26,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:07:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:07:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:07:28,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:07:29,488][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:07:30,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:07:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:07:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:07:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:07:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:07:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:07:34,509][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:07:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:07:35,943][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:07:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:07:37,379][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:07:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:07:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:07:39,534][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:07:40,252][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:07:40,969][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:07:41,688][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:07:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:07:43,126][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:07:44,100][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:07:44,819][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:07:45,536][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:07:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:07:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:07:47,694][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:07:48,412][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:07:49,132][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:07:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:07:50,568][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:07:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:07:52,007][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:07:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:07:53,443][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:07:54,161][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:07:54,879][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:07:55,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:56,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:07:57,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:57,395][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:57,397][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:58,753][__main__][INFO] - Iteration 233 took 55s (9.36% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 46m 16s. Estimated total time: 15h 29m 41s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 50s. [2026-03-25 18:07:58,758][__main__][INFO] - Starting iteration 233. [2026-03-25 18:07:58,762][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:07:58,763][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:08:04,700][__main__][INFO] - Number of regex retries in iteration 233: 0 [2026-03-25 18:08:04,702][__main__][INFO] - agents played in iteration 233 are Bob, Alice [2026-03-25 18:08:05,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:08:05,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:08:05,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:08:05,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:08:05,949][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:08:06,595][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:08:07,314][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:08:08,029][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:08:08,749][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:08:09,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:08:10,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:08:10,898][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:08:11,616][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:08:12,333][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:08:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:08:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:08:14,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:08:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:08:15,926][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:08:16,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:08:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:08:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:08:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:08:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:08:20,233][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:08:20,950][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:08:21,670][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:08:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:08:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:08:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:08:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:08:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:08:25,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:08:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:08:27,418][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:08:28,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:08:28,852][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:08:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:08:30,287][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:08:31,004][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:08:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:08:32,439][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:08:33,156][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:08:33,873][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:08:34,590][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:08:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:08:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:08:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:08:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:08:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:08:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:08:39,616][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:08:40,570][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:08:41,289][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:08:42,007][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:08:42,725][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:08:43,442][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:08:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:08:44,880][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:08:45,598][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:08:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:08:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:08:47,754][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:08:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:08:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:08:49,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:08:50,629][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:08:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:08:52,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:08:52,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:08:53,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:08:53,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:08:53,912][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:08:55,514][__main__][INFO] - Iteration 234 took 56s (10.47% Gen, 86.71% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 1m 32s. Estimated total time: 15h 45m 54s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 35s, 500 more iterations: 7h 52m 57s. [2026-03-25 18:08:55,518][__main__][INFO] - Starting iteration 234. [2026-03-25 18:08:55,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:08:55,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:09:00,792][__main__][INFO] - Number of regex retries in iteration 234: 0 [2026-03-25 18:09:00,794][__main__][INFO] - agents played in iteration 234 are Bob, Alice [2026-03-25 18:09:01,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:09:01,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:09:01,400][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:09:01,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:09:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:09:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:09:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:09:04,179][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:09:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:09:05,612][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:09:06,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:09:07,048][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:09:07,765][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:09:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:09:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:09:09,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:09:10,637][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:09:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:09:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:09:12,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:09:13,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:09:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:09:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:09:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:09:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:09:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:09:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:09:18,533][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:09:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:09:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:09:20,687][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:09:21,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:09:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:09:22,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:09:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:09:24,273][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:09:24,991][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:09:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:09:26,426][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:09:27,145][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:09:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:09:28,580][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:09:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:09:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:09:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:09:31,451][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:09:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:09:32,889][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:09:33,607][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:09:34,326][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:09:35,045][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:09:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:09:36,752][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:09:41,655][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:09:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:09:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:09:44,045][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:09:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:09:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:09:46,203][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:09:46,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:09:47,640][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:09:48,357][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:09:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:09:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:09:50,510][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:09:51,226][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:09:51,946][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:09:52,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:09:53,478][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 18:09:54,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:09:54,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:09:54,555][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:09:56,153][__main__][INFO] - Iteration 235 took 1m 0s (8.69% Gen, 88.67% Train). Generation: 5s, Training: 53s. Estimated remaining time: 13h 5m 10s. Estimated total time: 16h 50m 32s. Time estimates for 10 more iterations: 10m 6s, 100 more iterations: 1h 41m 3s, 500 more iterations: 8h 25m 16s. [2026-03-25 18:09:56,157][__main__][INFO] - Starting iteration 235. [2026-03-25 18:09:56,164][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:09:56,164][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:10:01,742][__main__][INFO] - Number of regex retries in iteration 235: 0 [2026-03-25 18:10:01,744][__main__][INFO] - agents played in iteration 235 are Bob, Alice [2026-03-25 18:10:02,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:10:02,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:10:02,399][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:10:02,399][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:10:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:10:03,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:10:04,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:10:05,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:10:05,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:10:06,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:10:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:10:08,083][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:10:08,801][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:10:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:10:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:10:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:10:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:10:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:10:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:10:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:10:14,552][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:10:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:10:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:10:16,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:10:17,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:10:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:10:18,874][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:10:19,598][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:10:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:10:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:10:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:10:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:10:23,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:10:23,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:10:24,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:10:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:10:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:10:26,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:10:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:10:28,236][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:10:28,956][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:10:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:10:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:10:31,125][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:10:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:10:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:10:33,286][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:10:34,006][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:10:34,726][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:10:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:10:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:10:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:10:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:10:38,608][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:10:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:10:40,046][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:10:40,767][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:10:41,487][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:10:42,207][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:10:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:10:43,650][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:10:44,369][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:10:45,092][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:10:45,814][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:10:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:10:47,256][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:10:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:10:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:10:49,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:10:50,170][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:10:51,286][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:10:51,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:10:51,292][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:10:52,715][__main__][INFO] - Iteration 236 took 56s (9.87% Gen, 87.61% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 56m 15s. Estimated total time: 15h 42m 33s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 15s, 500 more iterations: 7h 51m 16s. [2026-03-25 18:10:52,719][__main__][INFO] - Starting iteration 236. [2026-03-25 18:10:52,724][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:10:52,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:10:57,984][__main__][INFO] - Number of regex retries in iteration 236: 0 [2026-03-25 18:10:57,986][__main__][INFO] - agents played in iteration 236 are Bob, Alice [2026-03-25 18:10:58,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:10:58,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:10:58,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:10:58,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:10:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:10:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:01,363][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:02,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:02,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:11:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:11:04,952][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:11:05,668][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:11:06,388][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:11:07,106][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:11:07,824][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:11:08,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:11:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:11:09,978][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:11:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:11:11,416][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:11:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:11:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:11:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:11:14,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:11:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:11:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:11:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:11:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:11:17,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:11:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:11:19,322][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:11:20,041][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:11:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:11:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:11:22,198][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:11:22,916][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:11:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:11:24,356][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:11:25,076][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:11:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:11:26,515][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:11:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:11:27,955][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:11:28,674][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:11:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:11:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:11:30,836][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:11:31,557][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:11:32,276][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:11:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:11:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:11:34,675][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:11:35,394][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:11:36,114][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:11:36,835][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:11:37,555][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:11:38,274][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:11:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:11:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:11:40,436][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:11:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:11:41,876][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:11:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:11:43,318][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:11:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:11:44,758][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:11:45,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:11:46,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 18:11:47,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:11:47,601][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:11:47,602][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:11:49,031][__main__][INFO] - Iteration 237 took 56s (9.34% Gen, 88.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 51m 14s. Estimated total time: 15h 38m 29s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 14s. [2026-03-25 18:11:49,036][__main__][INFO] - Starting iteration 237. [2026-03-25 18:11:49,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:11:49,041][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:11:54,185][__main__][INFO] - Number of regex retries in iteration 237: 0 [2026-03-25 18:11:54,187][__main__][INFO] - agents played in iteration 237 are Bob, Alice [2026-03-25 18:11:54,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:11:54,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:11:54,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:11:54,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:11:55,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:11:56,096][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:56,817][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:57,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:58,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:58,971][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:12:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:12:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:12:01,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:12:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:12:03,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:12:04,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:12:04,721][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:12:05,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:12:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:12:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:12:07,597][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:12:08,317][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:09,035][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:09,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:12:10,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:12:11,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:12:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:12:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:12:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:12:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:12:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:12:15,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:12:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:12:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:12:17,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:12:18,394][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:12:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:12:19,834][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:12:20,554][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:12:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:12:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:12:22,717][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:12:23,437][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:12:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:12:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:12:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:12:26,321][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:12:27,043][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:12:27,763][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:12:28,483][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:12:29,205][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:12:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:12:30,901][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:12:31,622][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:12:32,343][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:12:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:12:33,787][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:12:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:12:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:12:35,949][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:12:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:12:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:12:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:12:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:12:39,558][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:12:40,277][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:12:40,999][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:12:41,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:12:42,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:12:43,711][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:12:43,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:12:43,717][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:12:45,373][__main__][INFO] - Iteration 238 took 56s (9.13% Gen, 87.92% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 50m 44s. Estimated total time: 15h 38m 55s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 27s. [2026-03-25 18:12:45,375][__main__][INFO] - Starting iteration 238. [2026-03-25 18:12:45,380][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:12:45,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:12:56,178][__main__][INFO] - Number of regex retries in iteration 238: 0 [2026-03-25 18:12:56,180][__main__][INFO] - agents played in iteration 238 are Bob, Alice [2026-03-25 18:12:56,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:12:56,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:12:56,746][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:12:56,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:12:57,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:12:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:12:58,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:12:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:13:00,248][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:13:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:13:01,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:13:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:13:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:13:03,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:13:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:13:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:13:05,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:13:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:13:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:13:08,150][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:13:08,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:13:09,588][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:13:10,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:13:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:13:11,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:13:12,462][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:13:13,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:13:13,900][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:13:14,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:13:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:13:16,056][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:13:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:13:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:13:19,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:13:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:13:21,092][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:13:21,815][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:13:22,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:13:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:13:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:13:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:13:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:13:26,135][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:13:26,857][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:13:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:13:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:13:29,019][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:13:29,739][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:13:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:13:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:13:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:13:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:13:33,618][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:13:34,338][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:13:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:13:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:13:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:13:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:13:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:13:38,669][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:13:39,389][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:13:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:13:40,834][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:13:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:13:42,276][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:13:42,998][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:13:43,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:13:44,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:13:45,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:13:45,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:13:45,595][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:13:47,172][__main__][INFO] - Iteration 239 took 1m 1s (17.48% Gen, 79.97% Train). Generation: 10s, Training: 49s. Estimated remaining time: 13h 20m 41s. Estimated total time: 17h 9m 53s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 59s, 500 more iterations: 8h 34m 56s. [2026-03-25 18:13:47,175][__main__][INFO] - Starting iteration 239. [2026-03-25 18:13:47,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:13:47,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:13:52,499][__main__][INFO] - Number of regex retries in iteration 239: 0 [2026-03-25 18:13:52,500][__main__][INFO] - agents played in iteration 239 are Bob, Alice [2026-03-25 18:13:52,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:13:53,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:13:53,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:13:53,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:13:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:13:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:13:55,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:13:55,861][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:13:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:13:57,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:13:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:13:58,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:13:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:14:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:14:00,897][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:14:01,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:14:02,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:14:03,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:14:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:14:04,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:14:05,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:14:05,939][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:14:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:14:07,380][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:14:08,100][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:14:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:14:09,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:14:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:14:10,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:14:11,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:14:12,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:14:13,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:14:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:14:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:14:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:14:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:14:16,756][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:14:17,477][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:14:18,200][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:14:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:14:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:14:20,368][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:14:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:14:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:14:22,533][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:14:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:14:23,976][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:14:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:14:25,419][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:14:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:14:26,865][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:14:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:14:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:14:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:14:29,996][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:14:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:14:31,443][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:14:32,164][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:14:32,888][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:14:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:14:34,335][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:14:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:14:35,782][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:14:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:14:37,229][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:14:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:14:38,676][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:14:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:14:40,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:14:40,864][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:14:41,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:14:41,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:14:41,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:14:44,475][__main__][INFO] - Iteration 240 took 57s (9.29% Gen, 86.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 4m 47s. Estimated total time: 15h 54m 57s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 29s, 500 more iterations: 7h 57m 28s. [2026-03-25 18:14:44,479][__main__][INFO] - Starting iteration 240. [2026-03-25 18:14:44,485][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:14:44,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:14:49,600][__main__][INFO] - Number of regex retries in iteration 240: 0 [2026-03-25 18:14:49,601][__main__][INFO] - agents played in iteration 240 are Bob, Alice [2026-03-25 18:14:50,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:14:50,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:14:50,171][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:14:50,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:14:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:14:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:14:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:14:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:14:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:14:54,404][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:14:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:14:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:14:56,566][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:14:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:14:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:14:58,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:14:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:15:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:15:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:15:01,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:15:02,336][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:15:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:15:03,776][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:15:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:15:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:15:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:15:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:15:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:15:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:15:08,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:15:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:15:10,276][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:15:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:15:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:15:12,442][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:15:13,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:15:13,889][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:15:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:15:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:15:16,054][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:15:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:15:17,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:15:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:15:18,942][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:15:19,665][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:15:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:15:21,112][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:15:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:15:22,558][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:15:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:15:24,004][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:15:24,728][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:15:25,695][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:15:26,420][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:15:27,143][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:15:27,869][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:15:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:15:29,319][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:15:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:15:30,768][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:15:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:15:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:15:32,937][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:15:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:15:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:15:35,113][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:15:35,837][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:15:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:15:37,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:15:38,082][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:15:39,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:15:39,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:15:39,407][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:15:41,416][__main__][INFO] - Iteration 241 took 56s (8.99% Gen, 87.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 57m 46s. Estimated total time: 15h 48m 53s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 53s, 500 more iterations: 7h 54m 26s. [2026-03-25 18:15:41,419][__main__][INFO] - Starting iteration 241. [2026-03-25 18:15:41,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:15:41,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:15:46,735][__main__][INFO] - Number of regex retries in iteration 241: 0 [2026-03-25 18:15:46,737][__main__][INFO] - agents played in iteration 241 are Bob, Alice [2026-03-25 18:15:47,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:15:47,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:15:47,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:15:47,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:15:47,992][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:15:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:15:49,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:15:50,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:15:50,805][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:15:51,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:15:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:15:52,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:15:53,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:15:54,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:15:55,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:15:55,850][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:15:56,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:15:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:15:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:15:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:15:59,464][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:00,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:16:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:16:04,518][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:16:05,241][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:16:05,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:16:06,686][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:16:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:16:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:16:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:16:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:16:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:16:11,024][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:16:11,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:16:12,468][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:16:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:16:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:16:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:16:15,361][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:16:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:16:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:16:17,530][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:16:18,254][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:16:18,978][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:16:19,703][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:16:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:16:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:16:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:16:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:16:23,600][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:16:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:16:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:16:25,770][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:16:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:16:27,217][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:16:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:16:28,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:16:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:16:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:16:30,840][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:16:31,564][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:16:32,288][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:16:33,012][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:16:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:16:34,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:16:35,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:16:36,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:16:36,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:16:36,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:16:38,025][__main__][INFO] - Iteration 242 took 56s (9.38% Gen, 88.01% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 51m 19s. Estimated total time: 15h 43m 23s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 20s, 500 more iterations: 7h 51m 41s. [2026-03-25 18:16:38,028][__main__][INFO] - Starting iteration 242. [2026-03-25 18:16:38,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:16:38,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:16:43,168][__main__][INFO] - Number of regex retries in iteration 242: 0 [2026-03-25 18:16:43,169][__main__][INFO] - agents played in iteration 242 are Bob, Alice [2026-03-25 18:16:43,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:16:43,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:16:43,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:16:43,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:16:44,485][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:16:45,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:16:45,858][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:16:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:16:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:16:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:16:48,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:16:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:16:50,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:16:50,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:16:51,627][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:52,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:16:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:16:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:16:54,514][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:16:55,235][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:16:55,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:58,123][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:58,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:00,290][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:01,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:02,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:03,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:17:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:17:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:17:07,523][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:17:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:17:08,970][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:17:09,694][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:17:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:17:11,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:17:11,862][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:17:12,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:17:13,314][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:17:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:17:14,762][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:17:15,486][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:17:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:17:16,932][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:17:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:17:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:17:19,334][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:17:20,058][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:17:20,782][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:17:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:17:22,232][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:17:22,960][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:17:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:17:24,408][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:17:25,132][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:17:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:17:26,580][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:17:27,303][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:17:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:17:28,753][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:17:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:17:30,204][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:17:30,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:17:31,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:17:32,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:17:32,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:17:32,914][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:17:34,584][__main__][INFO] - Iteration 243 took 56s (9.08% Gen, 87.96% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 49m 33s. Estimated total time: 15h 42m 33s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 15s, 500 more iterations: 7h 51m 16s. [2026-03-25 18:17:34,587][__main__][INFO] - Starting iteration 243. [2026-03-25 18:17:34,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:17:34,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:17:39,935][__main__][INFO] - Number of regex retries in iteration 243: 0 [2026-03-25 18:17:39,936][__main__][INFO] - agents played in iteration 243 are Bob, Alice [2026-03-25 18:17:40,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:17:40,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:17:40,536][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:17:40,537][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:17:41,233][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:17:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:17:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:17:43,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:17:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:17:44,775][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:17:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:17:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:17:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:17:47,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:17:48,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:17:49,111][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:17:49,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:17:50,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:17:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:17:52,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:17:52,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:17:53,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:17:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:17:54,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:17:55,616][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:17:56,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:58,507][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:59,228][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:18:00,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:18:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:18:02,122][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:02,847][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:18:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:18:05,019][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:05,743][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:08,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:10,815][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:12,986][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:18:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:18:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:18:15,160][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:18:16,132][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:18:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:18:17,580][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:18:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:18:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:18:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:18:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:18:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:18:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:18:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:18:23,379][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:18:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:18:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:18:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:18:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:18:27,007][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:18:27,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:18:28,521][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:18:29,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:18:29,675][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:18:29,677][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:18:31,044][__main__][INFO] - Iteration 244 took 56s (9.47% Gen, 88.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 46m 57s. Estimated total time: 15h 40m 54s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 5s, 500 more iterations: 7h 50m 27s. [2026-03-25 18:18:31,048][__main__][INFO] - Starting iteration 244. [2026-03-25 18:18:31,052][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:18:31,053][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:18:36,289][__main__][INFO] - Number of regex retries in iteration 244: 0 [2026-03-25 18:18:36,290][__main__][INFO] - agents played in iteration 244 are Bob, Alice [2026-03-25 18:18:36,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:18:36,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:18:36,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:18:36,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:18:37,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:18:38,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:18:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:18:39,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:18:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:18:41,147][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:18:41,869][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:18:42,592][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:18:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:18:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:18:44,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:18:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:18:46,210][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:18:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:18:47,656][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:18:48,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:18:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:18:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:18:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:18:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:18:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:18:52,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:18:53,456][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:18:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:18:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:18:55,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:18:56,355][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:18:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:18:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:18:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:59,252][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:59,977][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:19:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:19:01,430][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:19:02,156][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:19:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:19:03,606][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:19:04,332][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:19:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:19:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:19:06,507][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:19:07,233][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:19:07,960][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:19:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:19:09,412][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:19:10,137][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:11,591][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:12,593][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:13,318][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:14,768][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:16,945][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:19:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:19:18,397][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:19:19,123][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:19:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:19:20,578][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:19:21,304][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:19:22,029][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:19:22,755][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:19:23,482][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:19:24,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:19:24,947][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:19:26,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:19:26,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:19:26,217][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:19:27,630][__main__][INFO] - Iteration 245 took 56s (9.26% Gen, 88.24% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 48m 6s. Estimated total time: 15h 43m 0s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 18s, 500 more iterations: 7h 51m 30s. [2026-03-25 18:19:27,633][__main__][INFO] - Starting iteration 245. [2026-03-25 18:19:27,637][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:19:27,638][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:19:32,868][__main__][INFO] - Number of regex retries in iteration 245: 0 [2026-03-25 18:19:32,869][__main__][INFO] - agents played in iteration 245 are Bob, Alice [2026-03-25 18:19:33,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:19:33,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:19:33,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:19:33,425][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:19:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:19:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:19:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:19:36,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:19:36,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:19:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:19:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:19:39,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:19:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:19:40,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:19:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:19:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:19:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:19:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:19:44,185][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:19:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:19:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:19:46,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:19:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:19:47,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:19:48,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:19:49,258][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:19:49,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:19:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:19:51,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:19:52,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:19:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:19:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:19:54,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:19:55,059][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:19:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:19:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:19:57,238][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:19:57,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:19:58,689][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:19:59,415][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:20:00,141][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:20:00,868][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:20:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:20:02,320][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:20:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:20:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:20:04,499][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:20:05,227][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:20:05,953][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:20:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:20:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:20:08,132][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:20:09,101][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:20:09,829][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:20:10,554][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:20:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:20:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:20:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:20:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:20:14,187][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:20:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:20:15,641][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:20:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:20:17,097][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:20:17,824][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:20:18,552][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:19,278][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:20,005][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:20,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:21,478][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:20:22,532][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:20:22,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:20:22,537][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:20:24,215][__main__][INFO] - Iteration 246 took 56s (9.25% Gen, 87.78% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 47m 9s. Estimated total time: 15h 42m 59s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 17s, 500 more iterations: 7h 51m 29s. [2026-03-25 18:20:24,218][__main__][INFO] - Starting iteration 246. [2026-03-25 18:20:24,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:20:24,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:20:29,448][__main__][INFO] - Number of regex retries in iteration 246: 0 [2026-03-25 18:20:29,449][__main__][INFO] - agents played in iteration 246 are Bob, Alice [2026-03-25 18:20:29,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:20:30,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:20:30,013][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:20:30,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:20:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:20:31,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:20:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:20:32,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:20:33,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:20:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:20:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:20:35,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:20:36,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:20:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:20:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:20:38,611][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:20:39,337][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:20:40,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:20:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:20:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:20:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:20:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:20:43,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:20:44,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:20:45,134][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:20:45,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:20:46,584][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:20:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:20:48,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:20:48,759][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:20:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:20:50,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:20:50,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:20:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:20:52,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:20:53,117][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:20:53,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:20:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:20:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:20:56,023][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:20:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:20:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:20:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:20:58,929][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:20:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:21:00,381][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:21:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:21:01,833][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:21:02,561][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:21:03,287][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:21:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:21:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:21:05,718][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:21:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:21:07,171][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:21:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:21:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:21:09,355][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:21:10,082][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:21:10,809][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:21:11,535][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:21:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:21:12,990][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:21:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:21:14,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:21:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:21:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:21:16,622][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:21:17,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:21:18,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:21:19,526][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:21:19,575][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:21:19,577][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:21:22,180][__main__][INFO] - Iteration 247 took 57s (9.02% Gen, 86.48% Train). Generation: 5s, Training: 50s. Estimated remaining time: 12h 9m 12s. Estimated total time: 16h 6m 0s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 36s, 500 more iterations: 8h 3m 0s. [2026-03-25 18:21:22,184][__main__][INFO] - Starting iteration 247. [2026-03-25 18:21:22,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:21:22,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:28,242][__main__][INFO] - Number of regex retries in iteration 247: 0 [2026-03-25 18:21:28,243][__main__][INFO] - agents played in iteration 247 are Bob, Alice [2026-03-25 18:21:28,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:21:28,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:21:28,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:21:28,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:21:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:21:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:21:30,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:21:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:21:32,334][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:21:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:21:33,779][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:21:34,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:21:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:21:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:21:36,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:21:37,395][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:21:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:21:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:21:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:21:40,289][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:21:41,013][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:21:41,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:21:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:21:43,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:21:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:21:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:21:45,355][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:21:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:21:46,805][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:21:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:21:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:21:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:21:49,705][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:21:50,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:21:51,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:21:51,877][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:21:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:21:53,327][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:21:54,050][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:21:54,778][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:21:55,505][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:21:56,230][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:21:56,959][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:21:57,685][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:21:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:21:59,141][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:21:59,865][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:22:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:22:01,318][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:22:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:22:02,770][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:22:03,497][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:22:04,483][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:22:05,210][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:22:05,938][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:22:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:22:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:22:08,116][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:22:08,842][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:22:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:22:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:22:11,022][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:22:11,748][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:22:12,473][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:22:13,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:22:13,926][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:22:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:22:15,378][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:22:16,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:22:16,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:22:18,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:22:18,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:22:18,125][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:22:19,624][__main__][INFO] - Iteration 248 took 57s (10.54% Gen, 86.85% Train). Generation: 6s, Training: 49s. Estimated remaining time: 11h 59m 32s. Estimated total time: 15h 57m 18s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 43s, 500 more iterations: 7h 58m 39s. [2026-03-25 18:22:19,628][__main__][INFO] - Starting iteration 248. [2026-03-25 18:22:19,634][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:22:19,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:22:25,150][__main__][INFO] - Number of regex retries in iteration 248: 0 [2026-03-25 18:22:25,151][__main__][INFO] - agents played in iteration 248 are Bob, Alice [2026-03-25 18:22:25,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:22:25,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:22:25,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:22:25,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:22:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:22:27,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:22:27,764][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:22:28,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:22:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:22:29,931][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:22:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:22:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:22:32,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:22:32,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:22:33,547][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:22:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:22:34,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:22:35,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:22:36,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:22:37,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:22:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:22:38,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:22:39,340][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:22:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:22:40,788][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:22:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:22:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:22:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:22:43,687][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:22:44,414][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:22:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:22:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:22:46,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:47,312][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:48,039][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:22:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:22:49,491][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:22:50,217][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:22:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:22:51,670][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:22:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:22:53,122][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:22:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:22:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:22:55,299][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:22:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:22:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:22:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:22:58,202][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:22:58,928][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:22:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:23:02,066][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:23:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:23:03,516][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:23:04,243][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:23:04,970][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:23:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:23:06,424][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:23:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:23:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:23:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:23:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:23:10,056][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:23:10,783][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:23:11,509][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:23:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:23:12,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:23:13,692][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:23:14,899][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:23:14,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:23:14,906][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:23:16,327][__main__][INFO] - Iteration 249 took 56s (9.73% Gen, 87.76% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 46m 13s. Estimated total time: 15h 44m 55s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 29s, 500 more iterations: 7h 52m 27s. [2026-03-25 18:23:16,330][__main__][INFO] - Starting iteration 249. [2026-03-25 18:23:16,334][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:23:16,335][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:23:21,841][__main__][INFO] - Number of regex retries in iteration 249: 0 [2026-03-25 18:23:21,842][__main__][INFO] - agents played in iteration 249 are Bob, Alice [2026-03-25 18:23:22,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:23:22,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:23:22,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:23:22,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:23:23,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:23:23,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:23:24,549][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:23:25,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:23:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:23:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:23:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:23:28,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:23:28,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:23:29,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:23:30,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:23:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:23:31,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:23:32,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:23:33,232][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:23:33,957][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:23:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:23:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:23:36,129][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:23:36,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:23:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:23:38,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:23:39,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:23:39,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:23:40,482][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:23:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:23:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:23:42,660][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:23:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:23:44,113][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:23:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:23:45,564][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:23:46,289][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:23:47,014][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:23:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:23:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:23:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:23:49,916][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:23:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:23:51,368][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:23:52,094][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:23:52,819][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:23:53,545][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:23:54,271][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:23:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:23:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:23:56,449][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:57,176][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:58,150][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:23:58,875][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:23:59,600][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:00,327][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:01,778][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:02,506][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:24:03,233][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:24:03,958][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:24:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:24:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:24:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:24:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:24:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:24:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:24:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:24:10,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:24:11,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:24:12,296][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:24:12,300][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:24:12,302][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:24:13,912][__main__][INFO] - Iteration 250 took 57s (9.56% Gen, 87.63% Train). Generation: 5s, Training: 50s. Estimated remaining time: 12h 0m 0s. Estimated total time: 15h 59m 39s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 57s, 500 more iterations: 7h 59m 49s. [2026-03-25 18:24:13,915][__main__][INFO] - Starting iteration 250. [2026-03-25 18:24:13,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 18:24:13,921][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:24:19,182][__main__][INFO] - Number of regex retries in iteration 250: 0 [2026-03-25 18:24:19,183][__main__][INFO] - agents played in iteration 250 are Bob, Alice [2026-03-25 18:24:19,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:24:19,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:24:19,782][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:24:19,783][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:24:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:24:21,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:24:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:24:22,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:24:23,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:24:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:24:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:24:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:24:26,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:24:26,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:24:27,645][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:24:28,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:24:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:24:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:24:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:24:31,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:24:31,987][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:24:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:24:33,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:24:34,161][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:24:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:24:35,612][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:24:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:24:37,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:24:37,784][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:24:38,509][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:24:39,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:24:39,959][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:24:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:24:41,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:24:42,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:24:42,860][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:24:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:24:44,311][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:24:45,036][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:24:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:24:46,484][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:24:47,210][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:24:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:24:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:24:49,388][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:24:50,113][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:24:50,840][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:24:51,567][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:24:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:24:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:24:53,748][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:24:54,475][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:24:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:24:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:24:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:58,364][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:59,092][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:25:00,544][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:25:01,272][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:25:01,997][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:25:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:25:03,451][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:25:04,179][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:25:04,906][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:25:05,632][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:25:06,358][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:25:07,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:25:07,818][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:25:08,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:25:08,892][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:25:08,894][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:25:11,610][__main__][INFO] - Iteration 251 took 57s (9.12% Gen, 86.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 12h 0m 54s. Estimated total time: 16h 1m 32s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 9s, 500 more iterations: 8h 0m 46s. [2026-03-25 18:25:11,613][__main__][INFO] - Starting iteration 251. [2026-03-25 18:25:11,617][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:25:11,618][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:25:16,771][__main__][INFO] - Number of regex retries in iteration 251: 0 [2026-03-25 18:25:16,772][__main__][INFO] - agents played in iteration 251 are Bob, Alice [2026-03-25 18:25:17,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:25:17,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:25:17,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:25:17,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:25:18,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:25:18,673][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:25:19,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:25:20,120][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:25:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:25:21,567][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:25:22,289][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:25:23,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:25:23,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:25:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:25:25,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:25:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:25:26,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:25:27,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:25:28,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:25:28,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:25:29,523][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:25:30,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:25:30,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:25:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:25:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:25:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:25:33,873][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:25:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:25:35,321][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:25:36,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:25:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:25:37,497][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:25:38,224][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:25:38,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:25:39,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:25:40,401][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:25:41,128][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:25:41,854][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:25:42,582][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:25:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:25:44,035][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:25:44,763][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:25:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:25:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:25:46,943][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:25:47,670][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:25:48,396][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:25:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:25:49,848][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:25:50,575][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:25:51,303][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:25:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:25:52,993][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:25:53,721][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:25:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:25:55,175][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:25:55,902][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:25:56,629][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:25:57,357][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:25:58,084][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:25:58,811][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:25:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:00,266][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:00,993][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:26:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:26:02,449][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:26:03,177][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:26:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:26:04,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:26:05,384][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:26:06,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:26:06,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:26:06,798][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:26:08,175][__main__][INFO] - Iteration 252 took 56s (9.11% Gen, 88.45% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 41m 5s. Estimated total time: 15h 42m 39s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 15s, 500 more iterations: 7h 51m 19s. [2026-03-25 18:26:08,178][__main__][INFO] - Starting iteration 252. [2026-03-25 18:26:08,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:26:08,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:26:15,320][__main__][INFO] - Number of regex retries in iteration 252: 0 [2026-03-25 18:26:15,321][__main__][INFO] - agents played in iteration 252 are Bob, Alice [2026-03-25 18:26:15,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:26:15,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:26:15,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:26:15,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:26:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:26:17,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:26:17,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:26:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:26:19,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:26:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:26:20,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:26:21,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:26:22,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:26:23,041][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:26:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:26:24,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:26:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:26:25,939][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:26:26,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:26:27,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:26:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:26:28,833][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:26:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:26:30,282][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:26:31,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:26:31,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:26:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:26:33,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:26:33,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:26:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:26:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:26:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:26:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:26:37,535][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:26:38,260][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:26:38,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:26:39,712][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:26:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:26:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:26:41,887][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:26:42,614][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:26:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:26:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:26:44,793][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:26:45,520][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:26:46,249][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:26:46,975][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:26:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:26:48,425][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:26:49,150][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:26:49,876][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:26:50,601][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:26:51,575][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:26:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:26:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:26:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:26:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:26:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:26:55,940][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:26:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:26:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:26:58,121][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:58,850][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:27:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:27:01,759][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:27:03,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:27:04,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:27:05,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:27:05,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:27:05,316][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:27:06,882][__main__][INFO] - Iteration 253 took 58s (12.16% Gen, 85.17% Train). Generation: 7s, Training: 49s. Estimated remaining time: 12h 15m 50s. Estimated total time: 16h 18m 22s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 50s, 500 more iterations: 8h 9m 11s. [2026-03-25 18:27:06,886][__main__][INFO] - Starting iteration 253. [2026-03-25 18:27:06,892][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:27:06,893][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:27:12,077][__main__][INFO] - Number of regex retries in iteration 253: 0 [2026-03-25 18:27:12,078][__main__][INFO] - agents played in iteration 253 are Bob, Alice [2026-03-25 18:27:12,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:27:12,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:27:12,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:27:12,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:27:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:27:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:27:14,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:27:15,446][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:27:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:27:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:27:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:27:18,343][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:27:19,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:27:19,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:27:20,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:27:21,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:27:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:27:22,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:27:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:27:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:27:24,858][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:27:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:27:26,308][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:27:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:27:27,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:27:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:27:29,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:27:29,932][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:27:30,657][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:27:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:27:32,109][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:27:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:27:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:27:34,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:27:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:27:35,739][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:27:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:27:37,191][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:27:37,917][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:27:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:27:39,370][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:27:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:27:40,823][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:27:41,549][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:27:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:27:43,005][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:27:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:27:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:27:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:27:45,910][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:27:46,637][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:27:47,364][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:27:48,365][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:27:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:27:49,821][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:27:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:27:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:27:52,003][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:27:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:27:53,458][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:27:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:27:54,913][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:27:55,640][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:27:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:57,092][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:27:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:27:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:00,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:00,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:28:01,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:01,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:01,845][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:03,283][__main__][INFO] - Iteration 254 took 56s (9.19% Gen, 88.25% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 36m 24s. Estimated total time: 15h 39m 53s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 59s, 500 more iterations: 7h 49m 56s. [2026-03-25 18:28:03,285][__main__][INFO] - Starting iteration 254. [2026-03-25 18:28:03,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:28:03,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:28:08,621][__main__][INFO] - Number of regex retries in iteration 254: 0 [2026-03-25 18:28:08,622][__main__][INFO] - agents played in iteration 254 are Bob, Alice [2026-03-25 18:28:09,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:28:09,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:28:09,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:28:09,189][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:28:09,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:28:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:28:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:28:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:28:12,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:28:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:28:14,158][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:28:14,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:28:15,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:28:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:28:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:28:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:28:18,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:28:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:28:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:28:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:28:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:28:22,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:28:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:28:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:28:24,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:28:25,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:28:25,766][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:28:26,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:28:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:28:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:28:28,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:28:29,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:28:30,119][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:28:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:28:31,571][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:28:32,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:28:33,026][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:28:33,753][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:28:34,481][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:28:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:28:35,935][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:28:36,663][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:28:37,390][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:28:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:28:38,845][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:28:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:28:40,298][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:28:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:28:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:28:42,481][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:28:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:28:43,938][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:28:44,897][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:28:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:28:46,352][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:28:47,079][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:28:47,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:28:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:28:49,261][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:28:49,990][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:28:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:28:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:28:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:28:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:28:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:28:54,361][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:28:55,089][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:28:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:56,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:57,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:28:58,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:58,361][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:58,363][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:59,887][__main__][INFO] - Iteration 255 took 56s (9.42% Gen, 87.88% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 38m 54s. Estimated total time: 15h 43m 20s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 20s, 500 more iterations: 7h 51m 40s. [2026-03-25 18:28:59,890][__main__][INFO] - Starting iteration 255. [2026-03-25 18:28:59,894][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:28:59,895][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:29:05,132][__main__][INFO] - Number of regex retries in iteration 255: 0 [2026-03-25 18:29:05,133][__main__][INFO] - agents played in iteration 255 are Bob, Alice [2026-03-25 18:29:05,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:29:05,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:29:05,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:29:05,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:29:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:29:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:08,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:29:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:29:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:29:10,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:29:11,469][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:29:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:29:12,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:29:13,648][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:29:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:29:15,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:29:15,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:29:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:29:17,270][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:29:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:29:18,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:29:19,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:29:20,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:29:20,905][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:29:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:29:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:29:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:29:23,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:29:24,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:29:25,270][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:29:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:29:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:29:27,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:29:28,176][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:29:28,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:29:29,630][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:29:30,358][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:29:31,084][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:29:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:29:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:29:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:29:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:29:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:29:35,450][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:29:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:29:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:29:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:29:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:29:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:29:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:29:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:29:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:29:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:29:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:29:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:29:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:29:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:29:45,879][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:29:46,607][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:29:47,334][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:29:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:29:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:29:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:29:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:29:50,975][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:29:51,705][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:29:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:29:53,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:29:53,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:29:55,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:29:55,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:29:55,164][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:29:56,495][__main__][INFO] - Iteration 256 took 56s (9.25% Gen, 88.39% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 38m 0s. Estimated total time: 15h 43m 22s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 20s, 500 more iterations: 7h 51m 41s. [2026-03-25 18:29:56,497][__main__][INFO] - Starting iteration 256. [2026-03-25 18:29:56,501][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:29:56,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:30:01,867][__main__][INFO] - Number of regex retries in iteration 256: 0 [2026-03-25 18:30:01,868][__main__][INFO] - agents played in iteration 256 are Bob, Alice [2026-03-25 18:30:02,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:30:02,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:30:02,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:30:02,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:30:03,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:30:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:30:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:30:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:30:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:30:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:30:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:30:08,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:30:08,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:30:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:30:10,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:30:11,048][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:30:11,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:30:12,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:30:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:30:13,945][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:30:14,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:30:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:30:16,123][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:30:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:30:17,575][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:30:18,301][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:30:19,032][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:30:19,759][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:30:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:30:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:30:21,942][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:30:22,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:30:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:30:24,123][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:30:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:30:25,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:30:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:30:27,034][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:30:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:30:28,487][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:30:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:30:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:30:30,668][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:30:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:30:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:30:32,851][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:30:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:30:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:30:35,034][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:30:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:30:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:30:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:30:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:30:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:30:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:30:40,404][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:30:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:30:49,016][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:30:49,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:30:50,462][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:30:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:30:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:30:52,633][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:30:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:30:54,080][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:30:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:30:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:30:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:30:56,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:30:57,713][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:54 [2026-03-25 18:30:58,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:30:58,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:30:58,940][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:31:00,470][__main__][INFO] - Iteration 257 took 1m 3s (8.39% Gen, 89.22% Train). Generation: 5s, Training: 57s. Estimated remaining time: 13h 39m 44s. Estimated total time: 17h 46m 10s. Time estimates for 10 more iterations: 10m 39s, 100 more iterations: 1h 46m 37s, 500 more iterations: 8h 53m 5s. [2026-03-25 18:31:00,473][__main__][INFO] - Starting iteration 257. [2026-03-25 18:31:00,477][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:31:00,477][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:31:05,647][__main__][INFO] - Number of regex retries in iteration 257: 0 [2026-03-25 18:31:05,648][__main__][INFO] - agents played in iteration 257 are Bob, Alice [2026-03-25 18:31:06,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:31:06,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:31:06,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:31:06,217][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:31:06,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:31:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:31:08,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:31:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:12,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:13,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:14,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:18,390][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:31:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:31:22,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:31:23,454][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:31:24,177][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:31:24,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:31:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:31:26,349][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:31:27,072][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:31:27,798][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:31:28,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:31:29,247][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:31:29,971][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:31:30,696][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:31:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:31:32,146][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:31:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:31:33,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:31:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:31:35,048][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:31:35,773][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:31:36,499][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:31:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:31:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:31:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:31:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:31:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:31:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:31:41,816][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:31:42,543][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:31:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:31:43,992][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:31:44,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:31:45,447][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:31:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:31:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:31:47,631][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:31:48,357][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:31:49,086][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:31:49,815][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:31:50,542][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:31:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:31:51,996][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:31:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:31:53,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:31:54,190][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:31:55,563][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:31:55,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:31:55,570][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:31:57,011][__main__][INFO] - Iteration 258 took 56s (9.15% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 34m 53s. Estimated total time: 15h 42m 15s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 13s, 500 more iterations: 7h 51m 7s. [2026-03-25 18:31:57,015][__main__][INFO] - Starting iteration 258. [2026-03-25 18:31:57,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:31:57,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:32:02,369][__main__][INFO] - Number of regex retries in iteration 258: 0 [2026-03-25 18:32:02,370][__main__][INFO] - agents played in iteration 258 are Bob, Alice [2026-03-25 18:32:02,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:32:02,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:32:02,935][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:32:02,936][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:32:03,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:32:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:32:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:32:05,724][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:32:06,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:32:07,170][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:32:07,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:32:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:32:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:32:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:32:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:32:11,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:32:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:32:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:32:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:32:14,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:32:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:32:15,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:32:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:32:17,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:32:18,044][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:20,219][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:20,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:21,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:23,843][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:24,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:25,297][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:26,749][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:32:27,475][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:32:28,201][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:32:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:32:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:32:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:32:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:32:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:32:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:32:33,292][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:32:34,019][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:32:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:32:35,474][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:32:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:32:36,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:32:37,657][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:32:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:32:39,362][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:32:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:32:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:32:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:32:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:32:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:32:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:32:44,448][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:32:45,175][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:32:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:32:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:32:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:32:48,083][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:32:48,811][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:32:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:32:50,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:32:51,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:32:52,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:32:52,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:32:52,423][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:32:53,877][__main__][INFO] - Iteration 259 took 56s (9.41% Gen, 88.03% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 39m 19s. Estimated total time: 15h 47m 39s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 45s, 500 more iterations: 7h 53m 49s. [2026-03-25 18:32:53,880][__main__][INFO] - Starting iteration 259. [2026-03-25 18:32:53,884][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:32:53,885][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:33:02,339][__main__][INFO] - Number of regex retries in iteration 259: 0 [2026-03-25 18:33:02,340][__main__][INFO] - agents played in iteration 259 are Bob, Alice [2026-03-25 18:33:02,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:33:02,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:33:02,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:33:02,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:33:03,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:33:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:33:04,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:33:05,697][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:33:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:33:07,142][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:33:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:33:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:33:09,311][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:33:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:33:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:33:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:33:12,204][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:33:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:33:13,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:33:14,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:33:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:33:15,825][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:33:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:33:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:33:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:33:18,724][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:33:19,450][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:33:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:33:20,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:33:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:33:22,344][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:33:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:33:23,794][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:33:24,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:33:25,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:33:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:33:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:28,870][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:29,595][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:31,771][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:32,496][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:33,943][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:33:34,668][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:33:35,394][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:33:36,120][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:33:36,847][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:33:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:33:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:33:39,293][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:33:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:33:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:33:41,472][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:33:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:33:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:33:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:33:44,381][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:33:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:33:45,835][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:33:46,565][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:33:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:33:48,020][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:33:48,749][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:33:49,474][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:33:50,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:33:50,932][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:33:52,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:33:52,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:33:52,304][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:33:53,694][__main__][INFO] - Iteration 260 took 59s (14.14% Gen, 83.53% Train). Generation: 8s, Training: 49s. Estimated remaining time: 12h 27m 32s. Estimated total time: 16h 36m 52s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 41s, 500 more iterations: 8h 18m 26s. [2026-03-25 18:33:53,697][__main__][INFO] - Starting iteration 260. [2026-03-25 18:33:53,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:33:53,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:33:58,935][__main__][INFO] - Number of regex retries in iteration 260: 0 [2026-03-25 18:33:58,936][__main__][INFO] - agents played in iteration 260 are Bob, Alice [2026-03-25 18:33:59,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:33:59,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:33:59,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:33:59,520][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:34:00,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:34:00,866][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:34:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:34:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:34:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:34:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:34:04,484][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:34:05,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:34:05,930][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:34:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:34:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:34:08,104][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:34:08,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:34:09,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:34:10,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:34:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:34:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:34:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:34:13,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:34:13,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:34:14,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:34:15,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:34:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:34:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:34:17,524][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:34:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:34:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:34:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:34:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:34:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:34:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:34:22,600][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:34:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:34:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:34:24,779][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:34:25,505][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:34:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:34:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:34:27,686][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:34:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:34:29,139][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:34:29,866][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:34:30,592][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:31,319][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:32,046][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:33,500][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:34,228][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:35,185][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:34:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:34:36,641][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:34:37,369][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:34:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:34:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:34:39,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:34:40,283][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:34:41,011][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:34:41,736][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:34:42,463][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:34:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:34:43,916][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:34:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:34:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:34:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:34:46,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:34:47,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:34:48,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:34:48,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:34:48,713][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:34:50,197][__main__][INFO] - Iteration 261 took 56s (9.27% Gen, 88.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 31m 21s. Estimated total time: 15h 41m 37s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 48s. [2026-03-25 18:34:50,199][__main__][INFO] - Starting iteration 261. [2026-03-25 18:34:50,204][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:34:50,205][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:34:55,406][__main__][INFO] - Number of regex retries in iteration 261: 0 [2026-03-25 18:34:55,408][__main__][INFO] - agents played in iteration 261 are Bob, Alice [2026-03-25 18:34:55,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:34:55,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:34:55,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:34:55,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:34:56,663][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:34:57,314][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:34:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:34:58,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:34:59,488][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:35:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:35:00,933][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:35:01,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:35:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:35:03,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:35:03,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:35:04,553][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:35:05,277][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:35:06,001][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:35:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:35:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:35:08,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:35:08,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:35:09,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:35:10,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:35:11,089][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:35:11,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:35:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:35:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:35:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:35:14,721][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:35:15,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:35:16,172][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:35:16,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:35:17,625][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:35:18,351][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:35:19,076][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:35:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:35:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:35:21,258][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:35:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:35:22,711][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:35:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:35:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:35:24,893][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:35:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:35:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:35:27,075][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:35:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:35:28,533][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:35:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:35:29,987][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:35:30,715][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:35:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:35:32,399][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:35:33,127][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:35:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:35:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:35:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:35:36,040][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:35:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:35:37,495][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:35:38,223][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:35:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:35:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:35:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:35:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:35:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:35:42,587][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:35:43,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:35:44,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:35:45,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:35:45,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:35:45,295][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:35:46,734][__main__][INFO] - Iteration 262 took 56s (9.20% Gen, 88.24% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 31m 0s. Estimated total time: 15h 42m 12s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 13s, 500 more iterations: 7h 51m 6s. [2026-03-25 18:35:46,738][__main__][INFO] - Starting iteration 262. [2026-03-25 18:35:46,744][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:35:46,745][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:35:53,283][__main__][INFO] - Number of regex retries in iteration 262: 0 [2026-03-25 18:35:53,284][__main__][INFO] - agents played in iteration 262 are Bob, Alice [2026-03-25 18:35:53,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:35:53,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:35:53,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:35:53,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:35:54,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:35:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:35:55,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:35:56,696][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:35:57,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:35:58,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:35:58,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:35:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:36:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:36:01,037][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:36:01,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:36:02,484][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:36:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:36:03,931][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:36:04,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:36:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:36:06,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:36:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:36:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:36:08,275][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:36:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:36:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:36:10,448][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:36:11,172][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:36:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:36:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:36:13,346][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:36:14,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:36:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:36:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:36:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:36:16,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:36:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:36:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:36:19,150][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:36:19,877][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:36:20,604][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:36:21,330][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:36:22,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:36:22,783][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:36:23,509][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:36:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:36:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:36:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:36:26,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:36:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:36:27,869][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:36:28,596][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:36:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:36:30,328][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:36:31,056][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:36:39,692][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:36:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:36:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:36:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:36:42,583][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:36:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:36:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:36:44,752][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:36:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:36:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:36:46,924][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:36:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:36:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:36:49,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:36:49,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:55 [2026-03-25 18:36:51,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:36:52,451][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:36:52,455][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:36:55,564][__main__][INFO] - Iteration 263 took 1m 8s (9.50% Gen, 85.98% Train). Generation: 6s, Training: 59s. Estimated remaining time: 14h 54m 42s. Estimated total time: 19h 7m 3s. Time estimates for 10 more iterations: 11m 28s, 100 more iterations: 1h 54m 42s, 500 more iterations: 9h 33m 31s. [2026-03-25 18:36:55,567][__main__][INFO] - Starting iteration 263. [2026-03-25 18:36:55,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:36:55,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:37:09,323][__main__][INFO] - Number of regex retries in iteration 263: 0 [2026-03-25 18:37:09,325][__main__][INFO] - agents played in iteration 263 are Bob, Alice [2026-03-25 18:37:09,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:37:09,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:37:09,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:37:09,935][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:37:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:37:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:37:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:37:12,721][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:37:13,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:37:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:37:14,880][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:37:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:37:16,319][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:37:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:37:17,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:37:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:37:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:37:19,916][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:37:20,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:37:21,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:37:22,078][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:37:22,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:37:23,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:37:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:37:24,959][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:37:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:37:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:37:27,120][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:37:27,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:37:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:37:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:37:30,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:37:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:37:31,443][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:37:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:37:32,884][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:37:33,605][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:37:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:37:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:37:35,767][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:37:36,486][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:37:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:37:37,930][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:37:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:37:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:37:40,095][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:37:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:37:41,539][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:37:42,262][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:37:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:37:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:37:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:37:45,382][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:37:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:37:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:37:47,553][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:37:48,277][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:37:48,999][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:37:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:37:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:37:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:37:51,890][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:37:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:37:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:37:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:37:54,783][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:37:55,506][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:37:56,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:37:56,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:37:57,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:37:58,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:37:58,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:37:58,809][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:38:00,348][__main__][INFO] - Iteration 264 took 1m 4s (21.23% Gen, 76.39% Train). Generation: 13s, Training: 49s. Estimated remaining time: 13h 46m 12s. Estimated total time: 17h 59m 38s. Time estimates for 10 more iterations: 10m 47s, 100 more iterations: 1h 47m 57s, 500 more iterations: 8h 59m 49s. [2026-03-25 18:38:00,351][__main__][INFO] - Starting iteration 264. [2026-03-25 18:38:00,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:38:00,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:38:05,525][__main__][INFO] - Number of regex retries in iteration 264: 0 [2026-03-25 18:38:05,527][__main__][INFO] - agents played in iteration 264 are Bob, Alice [2026-03-25 18:38:06,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:38:06,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:38:06,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:38:06,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:38:06,796][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:38:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:38:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:38:08,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:38:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:38:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:38:11,055][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:38:11,777][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:38:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:38:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:38:13,941][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:38:14,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:38:15,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:38:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:38:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:38:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:38:18,272][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:38:18,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:38:19,717][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:38:20,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:38:21,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:38:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:38:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:38:23,329][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:38:24,051][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:38:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:38:25,499][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:38:26,221][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:38:26,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:38:27,669][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:38:28,392][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:38:29,115][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:38:29,838][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:38:30,561][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:38:31,284][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:38:32,008][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:38:32,734][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:38:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:38:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:38:34,906][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:38:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:38:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:38:37,077][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:38:37,801][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:38:38,527][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:38:39,253][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:38:39,978][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:38:40,704][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:38:41,666][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:38:42,390][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:38:43,113][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:38:43,837][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:38:44,562][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:38:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:38:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:38:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:38:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:38:48,187][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:38:48,912][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:38:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:38:50,365][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:38:51,090][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:38:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:38:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:38:53,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:38:54,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:38:55,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:38:55,287][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:38:55,289][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:38:56,627][__main__][INFO] - Iteration 265 took 56s (9.19% Gen, 88.43% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 23m 31s. Estimated total time: 15h 37m 54s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 57s. [2026-03-25 18:38:56,631][__main__][INFO] - Starting iteration 265. [2026-03-25 18:38:56,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:38:56,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:39:04,185][__main__][INFO] - Number of regex retries in iteration 265: 0 [2026-03-25 18:39:04,186][__main__][INFO] - agents played in iteration 265 are Bob, Alice [2026-03-25 18:39:04,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:39:04,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:39:04,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:39:04,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:39:05,450][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:39:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:39:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:39:07,546][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:39:08,265][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:39:08,987][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:39:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:39:10,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:39:11,153][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:39:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:39:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:39:18,359][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:39:19,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:39:19,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:39:20,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:39:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:39:21,959][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:39:22,679][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:39:23,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:39:24,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:39:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:39:25,562][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:39:26,283][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:39:27,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:39:27,727][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:39:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:39:29,171][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:39:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:39:30,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:39:31,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:39:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:39:32,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:39:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:39:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:39:35,081][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:39:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:39:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:39:37,247][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:39:37,970][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:39:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:39:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:39:40,147][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:39:40,870][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:39:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:39:42,312][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:39:43,034][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:39:43,759][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:39:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:39:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:39:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:39:46,936][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:39:47,658][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:39:48,383][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:39:49,107][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:39:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:39:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:39:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:39:52,000][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:39:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:39:53,447][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:39:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:39:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:39:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:39:56,345][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:39:57,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:39:57,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:52 [2026-03-25 18:39:58,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:39:58,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:39:58,995][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:40:00,402][__main__][INFO] - Iteration 266 took 1m 3s (11.84% Gen, 85.95% Train). Generation: 7s, Training: 54s. Estimated remaining time: 13h 27m 23s. Estimated total time: 17h 42m 49s. Time estimates for 10 more iterations: 10m 37s, 100 more iterations: 1h 46m 16s, 500 more iterations: 8h 51m 24s. [2026-03-25 18:40:00,414][__main__][INFO] - Starting iteration 266. [2026-03-25 18:40:00,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:40:00,451][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:40:05,685][__main__][INFO] - Number of regex retries in iteration 266: 0 [2026-03-25 18:40:05,686][__main__][INFO] - agents played in iteration 266 are Bob, Alice [2026-03-25 18:40:06,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:40:06,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:40:06,266][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:40:06,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:40:06,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:40:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:40:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:40:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:40:09,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:40:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:40:11,224][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:40:11,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:40:12,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:40:13,392][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:40:14,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:40:14,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:40:15,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:40:16,282][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:40:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:40:17,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:40:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:40:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:40:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:40:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:40:21,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:40:22,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:40:22,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:40:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:40:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:40:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:40:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:40:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:40:27,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:40:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:40:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:40:29,307][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:40:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:40:30,757][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:40:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:40:32,205][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:40:32,929][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:40:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:40:34,378][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:40:35,104][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:40:35,829][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:40:36,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:40:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:40:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:40:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:40:39,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:40:40,179][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:40:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:40:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:40:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:40:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:40:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:40:44,784][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:40:45,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:40:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:40:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:40:47,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:40:48,404][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:40:49,131][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:40:49,854][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:40:50,581][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:40:51,306][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:40:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:40:52,760][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:40:53,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:40:54,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:40:55,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:40:55,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:40:55,495][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:40:56,864][__main__][INFO] - Iteration 267 took 56s (9.28% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 23m 54s. Estimated total time: 15h 40m 16s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 8s. [2026-03-25 18:40:56,866][__main__][INFO] - Starting iteration 267. [2026-03-25 18:40:56,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:40:56,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:41:02,051][__main__][INFO] - Number of regex retries in iteration 267: 0 [2026-03-25 18:41:02,052][__main__][INFO] - agents played in iteration 267 are Bob, Alice [2026-03-25 18:41:02,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:41:02,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:41:02,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:41:02,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:41:03,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:41:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:41:05,053][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:41:05,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:41:06,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:41:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:41:07,942][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:41:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:41:09,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:41:10,114][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:41:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:41:11,561][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:41:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:41:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:41:13,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:41:14,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:41:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:41:15,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:41:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:41:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:41:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:41:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:41:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:41:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:41:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:41:21,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:41:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:41:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:41:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:41:24,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:41:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:41:26,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:41:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:41:27,493][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:41:28,219][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:41:28,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:41:29,670][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:41:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:41:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:41:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:41:32,570][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:41:33,294][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:41:34,018][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:41:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:41:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:41:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:41:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:41:37,647][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:41:38,618][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:41:39,345][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:41:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:41:40,801][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:41:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:41:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:41:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:41:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:41:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:41:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:41:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:41:46,616][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:41:47,342][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:41:48,069][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:41:48,796][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:41:49,522][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:41:50,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:41:51,005][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:41:52,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:41:52,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:41:52,169][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:41:53,554][__main__][INFO] - Iteration 268 took 56s (9.14% Gen, 88.41% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 27m 26s. Estimated total time: 15h 44m 46s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 28s, 500 more iterations: 7h 52m 23s. [2026-03-25 18:41:53,558][__main__][INFO] - Starting iteration 268. [2026-03-25 18:41:53,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:41:53,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:41:58,844][__main__][INFO] - Number of regex retries in iteration 268: 0 [2026-03-25 18:41:58,846][__main__][INFO] - agents played in iteration 268 are Bob, Alice [2026-03-25 18:41:59,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:41:59,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:41:59,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:41:59,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:42:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:42:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:42:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:42:02,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:42:02,943][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:42:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:42:04,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:42:05,112][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:42:05,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:42:06,558][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:42:07,282][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:42:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:42:08,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:42:09,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:42:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:42:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:42:11,625][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:42:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:42:13,074][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:42:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:42:14,524][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:42:15,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:42:15,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:42:16,698][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:42:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:42:18,147][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:42:18,871][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:42:19,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:42:20,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:42:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:42:21,766][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:42:22,492][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:42:23,217][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:42:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:42:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:42:25,393][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:42:26,120][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:42:26,845][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:42:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:42:28,296][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:42:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:42:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:42:30,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:42:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:42:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:42:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:42:33,372][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:42:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:42:35,092][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:42:35,820][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:42:36,547][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:42:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:42:38,000][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:42:38,726][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:42:39,456][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:42:40,182][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:42:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:42:41,634][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:42:42,361][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:42:43,088][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:42:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:42:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:42:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:42:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:42:46,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:42:47,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:42:48,500][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:42:48,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:42:48,505][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:42:49,964][__main__][INFO] - Iteration 269 took 56s (9.35% Gen, 88.06% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 21m 38s. Estimated total time: 15h 39m 54s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 59s, 500 more iterations: 7h 49m 57s. [2026-03-25 18:42:49,967][__main__][INFO] - Starting iteration 269. [2026-03-25 18:42:49,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:42:49,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:42:53,166][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2026-03-25 18:42:57,408][__main__][INFO] - Number of regex retries in iteration 269: 1 [2026-03-25 18:42:57,410][__main__][INFO] - agents played in iteration 269 are Bob, Alice [2026-03-25 18:42:57,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:42:58,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:42:58,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:42:58,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:42:58,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:42:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:43:00,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:43:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:43:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:43:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:43:03,004][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:43:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:43:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:43:05,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:43:05,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:43:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:43:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:43:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:43:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:43:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:43:10,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:43:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:43:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:43:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:43:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:43:13,848][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:43:14,570][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:43:15,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:43:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:43:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:43:17,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:43:18,189][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:43:18,914][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:43:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:43:20,362][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:43:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:43:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:43:22,534][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:43:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:43:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:43:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:43:25,434][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:43:26,158][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:43:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:43:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:43:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:43:29,063][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:43:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:43:30,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:43:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:43:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:43:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:43:33,647][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:43:34,375][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:43:35,100][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:43:35,825][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:43:36,551][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:43:37,277][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:43:38,004][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:43:38,729][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:43:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:43:40,183][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:43:40,908][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:43:41,635][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:43:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:43:43,090][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:43:43,817][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:43:44,544][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:43:45,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:43:46,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:43:47,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:43:47,265][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:43:47,269][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:43:48,640][__main__][INFO] - Iteration 270 took 58s (12.68% Gen, 84.98% Train). Generation: 7s, Training: 49s. Estimated remaining time: 11h 58m 36s. Estimated total time: 16h 17m 50s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 47s, 500 more iterations: 8h 8m 55s. [2026-03-25 18:43:48,642][__main__][INFO] - Starting iteration 270. [2026-03-25 18:43:48,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:43:48,647][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:43:56,705][__main__][INFO] - Number of regex retries in iteration 270: 0 [2026-03-25 18:43:56,706][__main__][INFO] - agents played in iteration 270 are Bob, Alice [2026-03-25 18:43:57,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:43:57,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:43:57,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:43:57,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:43:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:43:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:43:59,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:44:00,147][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:44:00,870][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:44:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:44:02,315][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:44:03,036][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:44:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:44:04,483][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:44:05,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:44:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:44:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:44:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:44:08,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:44:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:44:09,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:44:10,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:44:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:44:11,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:44:12,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:44:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:44:13,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:44:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:44:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:44:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:44:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:44:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:44:18,235][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:44:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:44:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:44:20,408][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:44:21,132][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:44:21,857][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:44:22,584][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:44:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:44:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:44:24,759][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:44:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:44:26,207][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:44:26,932][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:44:27,657][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:44:28,381][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:44:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:44:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:44:30,556][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:44:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:44:32,007][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:44:32,965][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:44:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:44:34,415][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:44:35,139][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:44:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:44:36,589][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:44:37,316][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:44:38,041][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:44:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:44:39,491][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:44:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:44:40,942][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:44:41,669][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:44:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:44:43,124][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:44:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:44:44,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:44:45,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:44:46,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:44:46,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:44:46,541][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:44:48,068][__main__][INFO] - Iteration 271 took 59s (13.56% Gen, 83.86% Train). Generation: 8s, Training: 49s. Estimated remaining time: 12h 10m 10s. Estimated total time: 16h 30m 24s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 2s, 500 more iterations: 8h 15m 12s. [2026-03-25 18:44:48,072][__main__][INFO] - Starting iteration 271. [2026-03-25 18:44:48,076][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:44:48,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:44:53,223][__main__][INFO] - Number of regex retries in iteration 271: 0 [2026-03-25 18:44:53,225][__main__][INFO] - agents played in iteration 271 are Bob, Alice [2026-03-25 18:44:53,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:44:53,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:44:53,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:44:53,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:44:54,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:44:55,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:44:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:44:56,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:44:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:44:58,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:44:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:44:59,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:45:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:45:00,914][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:45:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:45:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:45:03,084][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:45:03,809][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:45:04,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:45:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:45:05,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:45:06,706][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:45:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:45:08,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:45:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:45:09,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:45:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:45:11,056][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:45:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:45:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:45:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:45:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:45:14,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:45:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:45:16,132][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:45:16,856][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:45:17,581][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:45:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:45:19,031][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:45:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:45:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:45:21,207][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:45:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:45:22,658][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:45:23,383][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:45:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:45:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:45:25,561][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:45:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:45:27,014][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:45:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:45:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:45:29,477][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:45:30,203][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:45:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:45:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:45:32,380][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:45:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:45:33,833][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:45:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:45:35,287][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:45:36,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:45:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:45:37,469][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:45:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:45:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:45:39,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:45:40,376][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:45:41,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:45:41,842][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:45:42,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:45:42,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:45:42,960][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:45:44,328][__main__][INFO] - Iteration 272 took 56s (9.15% Gen, 88.41% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 16m 23s. Estimated total time: 15h 37m 33s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 46s. [2026-03-25 18:45:44,331][__main__][INFO] - Starting iteration 272. [2026-03-25 18:45:44,341][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:45:44,342][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:45:49,550][__main__][INFO] - Number of regex retries in iteration 272: 0 [2026-03-25 18:45:49,552][__main__][INFO] - agents played in iteration 272 are Bob, Alice [2026-03-25 18:45:50,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:45:50,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:45:50,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:45:50,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:45:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:45:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:45:52,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:45:52,931][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:45:53,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:45:54,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:45:55,104][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:45:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:45:56,555][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:45:57,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:45:58,005][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:45:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:45:59,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:46:00,177][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:46:00,901][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:46:01,625][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:46:02,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:46:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:46:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:46:04,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:46:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:46:05,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:46:06,695][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:46:07,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:46:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:46:08,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:46:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:46:10,326][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:46:11,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:46:11,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:46:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:46:13,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:46:13,953][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:46:14,679][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:46:15,405][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:46:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:46:16,855][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:46:17,581][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:46:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:46:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:46:19,756][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:46:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:46:21,207][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:46:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:46:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:46:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:46:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:46:24,842][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:46:25,818][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:46:26,546][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:46:27,271][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:46:27,998][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:46:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:46:29,452][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:46:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:46:30,906][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:46:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:46:32,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:46:33,089][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:46:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:46:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:46:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:46:35,999][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:46:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:46:37,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:46:38,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:46:39,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:46:39,451][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:46:39,453][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:46:40,900][__main__][INFO] - Iteration 273 took 56s (9.21% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 20m 35s. Estimated total time: 15h 42m 42s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 16s, 500 more iterations: 7h 51m 21s. [2026-03-25 18:46:40,904][__main__][INFO] - Starting iteration 273. [2026-03-25 18:46:40,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:46:40,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:46:47,133][__main__][INFO] - Number of regex retries in iteration 273: 0 [2026-03-25 18:46:47,134][__main__][INFO] - agents played in iteration 273 are Bob, Alice [2026-03-25 18:46:47,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:46:47,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:46:47,695][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:46:47,696][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:46:48,391][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:46:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:46:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:46:50,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:46:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:46:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:46:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:46:53,378][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:46:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:46:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:46:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:46:56,270][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:46:56,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:46:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:46:58,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:46:59,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:46:59,889][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:47:00,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:47:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:47:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:47:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:47:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:47:04,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:47:04,951][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:47:05,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:47:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:47:07,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:47:07,846][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:47:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:47:09,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:47:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:47:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:47:11,467][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:47:12,192][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:47:12,916][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:47:13,639][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:47:14,363][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:47:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:47:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:47:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:47:17,265][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:47:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:47:18,713][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:47:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:47:20,162][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:47:20,886][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:47:21,613][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:47:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:47:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:47:24,019][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:47:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:47:25,467][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:47:26,190][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:47:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:47:27,638][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:47:28,364][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:47:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:47:29,817][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:47:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:47:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:47:31,990][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:47:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:47:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:47:34,162][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:47:34,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:47:35,700][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:47:37,765][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:47:37,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:47:37,770][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:47:39,851][__main__][INFO] - Iteration 274 took 58s (10.55% Gen, 85.91% Train). Generation: 6s, Training: 50s. Estimated remaining time: 11h 59m 14s. Estimated total time: 16h 22m 20s. Time estimates for 10 more iterations: 9m 49s, 100 more iterations: 1h 38m 14s, 500 more iterations: 8h 11m 10s. [2026-03-25 18:47:39,854][__main__][INFO] - Starting iteration 274. [2026-03-25 18:47:39,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:47:39,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:47:47,071][__main__][INFO] - Number of regex retries in iteration 274: 0 [2026-03-25 18:47:47,072][__main__][INFO] - agents played in iteration 274 are Bob, Alice [2026-03-25 18:47:47,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:47:47,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:47:47,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:47:47,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:47:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:47:48,993][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:47:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:47:50,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:47:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:47:51,879][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:47:52,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:47:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:47:54,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:47:54,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:47:55,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:47:56,210][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:47:56,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:47:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:47:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:47:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:47:59,820][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:48:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:48:01,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:48:01,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:48:02,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:48:03,431][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:48:04,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:48:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:48:05,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:48:06,322][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:48:07,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:48:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:48:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:48:09,216][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:48:09,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:48:10,661][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:48:11,383][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:48:12,108][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:48:12,834][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:48:13,558][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:48:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:48:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:48:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:48:16,460][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:48:17,183][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:48:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:48:18,633][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:48:19,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:48:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:48:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:48:21,534][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:48:22,259][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:48:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:48:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:48:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:48:25,434][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:48:26,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:48:26,883][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:48:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:48:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:48:29,056][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:48:29,781][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:48:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:48:31,229][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:48:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:48:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:48:33,401][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:48:34,127][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:48:34,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:48:35,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:48:36,619][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:48:36,623][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:48:36,624][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:48:38,076][__main__][INFO] - Iteration 275 took 58s (12.39% Gen, 85.11% Train). Generation: 7s, Training: 49s. Estimated remaining time: 11h 46m 16s. Estimated total time: 16h 10m 19s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 1s, 500 more iterations: 8h 5m 9s. [2026-03-25 18:48:38,079][__main__][INFO] - Starting iteration 275. [2026-03-25 18:48:38,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:48:38,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:48:43,258][__main__][INFO] - Number of regex retries in iteration 275: 0 [2026-03-25 18:48:43,259][__main__][INFO] - agents played in iteration 275 are Bob, Alice [2026-03-25 18:48:43,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:48:43,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:48:43,827][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:48:43,828][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:48:44,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:48:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:48:45,895][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:48:46,617][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:48:47,339][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:48:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:48:48,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:48:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:48:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:48:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:48:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:48:52,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:48:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:48:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:48:54,570][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:48:55,292][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:48:56,017][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:48:56,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:48:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:48:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:48:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:48:59,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:49:00,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:49:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:49:01,803][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:49:02,527][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:49:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:49:03,973][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:49:04,697][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:49:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:49:06,145][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:49:06,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:49:07,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:49:08,319][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:49:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:49:09,771][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:49:10,496][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:49:11,221][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:49:11,946][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:49:12,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:49:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:49:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:49:14,842][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:49:15,567][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:49:16,291][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:49:17,017][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:49:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:49:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:49:19,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:49:20,152][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:49:20,876][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:49:21,599][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:49:22,325][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:49:23,050][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:49:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:49:24,500][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:49:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:49:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:49:26,680][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:49:27,405][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:49:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:49:28,855][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:49:29,580][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:49:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:49:31,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:49:31,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:49:32,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:49:32,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:49:32,810][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:49:34,305][__main__][INFO] - Iteration 276 took 56s (9.20% Gen, 88.13% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 12m 4s. Estimated total time: 15h 37m 4s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 32s. [2026-03-25 18:49:34,309][__main__][INFO] - Starting iteration 276. [2026-03-25 18:49:34,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:49:34,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:49:43,304][__main__][INFO] - Number of regex retries in iteration 276: 0 [2026-03-25 18:49:43,305][__main__][INFO] - agents played in iteration 276 are Bob, Alice [2026-03-25 18:49:43,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:49:43,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:49:43,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:49:43,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:49:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:49:45,232][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:49:45,955][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:49:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:49:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:49:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:49:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:49:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:49:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:49:51,000][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:49:51,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:49:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:49:53,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:49:53,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:49:54,609][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:49:55,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:49:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:49:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:49:57,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:49:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:49:58,941][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:49:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:50:00,387][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:50:01,110][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:50:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:50:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:50:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:50:04,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:50:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:50:05,446][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:50:06,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:50:06,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:50:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:50:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:50:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:50:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:50:10,509][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:50:11,232][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:50:11,955][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:50:12,679][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:50:13,403][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:50:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:50:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:50:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:50:16,305][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:50:17,029][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:50:17,752][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:50:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:50:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:50:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:50:20,886][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:50:21,612][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:50:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:50:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:50:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:50:24,515][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:50:25,241][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:50:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:50:26,688][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:50:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:50:28,137][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:50:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:50:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:50:30,312][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:50:31,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:50:31,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:50:33,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:50:33,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:50:33,079][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:50:34,558][__main__][INFO] - Iteration 277 took 1m 0s (14.92% Gen, 82.62% Train). Generation: 8s, Training: 49s. Estimated remaining time: 12h 18m 5s. Estimated total time: 16h 44m 5s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 24s, 500 more iterations: 8h 22m 2s. [2026-03-25 18:50:34,562][__main__][INFO] - Starting iteration 277. [2026-03-25 18:50:34,568][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:50:34,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:50:39,720][__main__][INFO] - Number of regex retries in iteration 277: 0 [2026-03-25 18:50:39,722][__main__][INFO] - agents played in iteration 277 are Bob, Alice [2026-03-25 18:50:40,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:50:40,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:50:40,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:50:40,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:50:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:50:41,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:50:42,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:50:43,250][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:50:43,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:50:44,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:50:45,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:50:46,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:50:46,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:50:47,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:50:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:50:49,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:50:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:50:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:50:51,199][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:50:51,923][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:50:52,647][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:50:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:50:54,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:50:54,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:50:55,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:50:56,263][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:50:56,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:50:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:50:58,433][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:50:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:50:59,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:51:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:51:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:51:02,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:51:02,772][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:51:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:51:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:51:04,940][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:51:05,666][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:51:06,389][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:51:07,114][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:51:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:51:08,562][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:51:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:51:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:51:10,734][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:51:11,458][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:51:12,181][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:51:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:51:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:51:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:51:15,081][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:51:16,083][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:51:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:51:17,529][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:51:18,254][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:51:18,978][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:51:19,703][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:51:20,428][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:51:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:51:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:51:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:51:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:51:24,052][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:51:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:51:25,500][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:51:26,225][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:51:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:51:27,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:51:28,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:51:29,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:51:29,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:51:29,722][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:51:31,308][__main__][INFO] - Iteration 278 took 56s (9.08% Gen, 88.12% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 18m 45s. Estimated total time: 15h 45m 42s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 34s, 500 more iterations: 7h 52m 51s. [2026-03-25 18:51:31,313][__main__][INFO] - Starting iteration 278. [2026-03-25 18:51:31,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:51:31,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:51:36,615][__main__][INFO] - Number of regex retries in iteration 278: 0 [2026-03-25 18:51:36,617][__main__][INFO] - agents played in iteration 278 are Bob, Alice [2026-03-25 18:51:37,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:51:37,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:51:37,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:51:37,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:51:37,895][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:51:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:51:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:51:39,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:51:40,713][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:51:41,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:51:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:51:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:51:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:51:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:51:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:51:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:51:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:51:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:51:47,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:51:48,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:51:49,382][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:51:50,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:51:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:51:51,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:51:52,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:51:53,999][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:51:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:51:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:51:55,171][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:51:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:51:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:51:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:51:58,064][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:51:58,789][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:51:59,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:52:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:52:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:52:01,684][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:52:02,407][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:52:03,132][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:52:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:52:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:52:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:52:06,029][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:52:06,752][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:52:07,475][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:52:08,199][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:52:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:52:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:52:10,809][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:52:11,534][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:52:12,257][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:52:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:52:13,936][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:52:14,660][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:52:15,383][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:52:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:52:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:52:17,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:52:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:52:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:52:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:52:20,454][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:52:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:52:21,903][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:52:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:52:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:52:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:52:24,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:52:25,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:52:26,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:52:26,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:52:26,720][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:52:28,155][__main__][INFO] - Iteration 279 took 56s (9.32% Gen, 88.15% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 19m 24s. Estimated total time: 15h 47m 18s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 43s, 500 more iterations: 7h 53m 39s. [2026-03-25 18:52:28,159][__main__][INFO] - Starting iteration 279. [2026-03-25 18:52:28,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:52:28,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:52:33,451][__main__][INFO] - Number of regex retries in iteration 279: 0 [2026-03-25 18:52:33,453][__main__][INFO] - agents played in iteration 279 are Bob, Alice [2026-03-25 18:52:33,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:52:34,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:52:34,014][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:52:34,015][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:52:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:52:35,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:52:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:52:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:52:37,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:52:38,248][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:52:38,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:52:39,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:52:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:52:41,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:52:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:52:48,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:52:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:52:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:52:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:52:51,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:52:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:52:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:52:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:52:54,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:52:54,936][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:52:55,656][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:52:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:52:57,100][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:52:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:52:58,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:52:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:52:59,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:53:00,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:53:01,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:53:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:53:02,874][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:53:03,595][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:53:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:53:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:53:13,164][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:53:13,886][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:53:14,608][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:53:15,328][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:53:16,049][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:53:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:53:17,491][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:53:18,211][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:53:18,933][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:53:19,653][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:53:20,373][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:53:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:53:21,814][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:53:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:53:23,509][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:53:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:53:25,569][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:53:26,288][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:53:27,008][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:53:27,728][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:53:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:53:29,171][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:53:36,771][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:53:37,645][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:53:38,365][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:53:39,084][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:53:39,804][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:53:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:53:41,244][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:53:41,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:53:42,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:01:08 [2026-03-25 18:53:44,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:53:44,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:53:44,098][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:53:45,484][__main__][INFO] - Iteration 280 took 1m 17s (6.84% Gen, 91.36% Train). Generation: 5s, Training: 1m 10s. Estimated remaining time: 16h 59m 31s. Estimated total time: 21h 28m 42s. Time estimates for 10 more iterations: 12m 53s, 100 more iterations: 2h 8m 52s, 500 more iterations: 10h 44m 21s. [2026-03-25 18:53:45,488][__main__][INFO] - Starting iteration 280. [2026-03-25 18:53:45,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:53:45,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:53:50,677][__main__][INFO] - Number of regex retries in iteration 280: 0 [2026-03-25 18:53:50,678][__main__][INFO] - agents played in iteration 280 are Bob, Alice [2026-03-25 18:53:51,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:53:51,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:53:51,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:53:51,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:53:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:53:52,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:53:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:53:54,022][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:53:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:53:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:53:56,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:53:56,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:53:57,615][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:53:58,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:53:59,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:53:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:54:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:54:01,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:54:01,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:54:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:54:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:54:04,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:54:04,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:54:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:54:06,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:54:06,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:54:07,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:54:08,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:54:09,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:54:09,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:54:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:54:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:54:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:54:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:54:13,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:54:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:54:14,939][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:54:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:54:16,388][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:54:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:54:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:54:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:54:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:54:20,009][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:54:20,730][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:54:21,450][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:54:22,171][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:54:22,894][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:54:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:54:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:54:25,061][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:54:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:54:26,762][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:54:27,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:54:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:54:28,931][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:54:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:54:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:54:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:54:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:54:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:54:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:54:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:54:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:54:35,431][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:54:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:54:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:54:37,596][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:54:38,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:54:39,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:54:40,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:54:40,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:54:40,252][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:54:41,676][__main__][INFO] - Iteration 281 took 56s (9.23% Gen, 88.23% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 6m 19s. Estimated total time: 15h 36m 26s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 13s. [2026-03-25 18:54:41,679][__main__][INFO] - Starting iteration 281. [2026-03-25 18:54:41,684][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:54:41,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:54:46,780][__main__][INFO] - Number of regex retries in iteration 281: 0 [2026-03-25 18:54:46,781][__main__][INFO] - agents played in iteration 281 are Bob, Alice [2026-03-25 18:54:47,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:54:47,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:54:47,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:54:47,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:54:48,029][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:54:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:54:49,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:54:50,126][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:54:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:54:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:54:52,286][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:54:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:54:53,725][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:54:54,445][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:54:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:54:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:54:56,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:54:57,328][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:54:58,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:54:58,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:54:59,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:55:00,214][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:55:00,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:55:01,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:55:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:55:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:55:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:55:04,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:55:05,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:55:05,989][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:55:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:55:07,434][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:55:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:55:08,879][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:55:09,601][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:55:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:55:11,045][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:55:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:55:12,492][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:55:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:55:13,935][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:55:14,658][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:55:15,381][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:55:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:55:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:55:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:55:18,271][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:55:18,993][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:55:19,717][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:55:20,438][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:55:21,161][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:55:21,887][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:55:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:55:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:55:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:55:25,022][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:55:25,745][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:55:26,468][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:55:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:55:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:55:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:55:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:55:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:55:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:55:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:55:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:55:32,983][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:55:33,709][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:55:34,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:55:35,178][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:55:36,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:55:36,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:55:36,356][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:55:37,751][__main__][INFO] - Iteration 282 took 56s (9.09% Gen, 88.42% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 3m 26s. Estimated total time: 15h 34m 30s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 15s. [2026-03-25 18:55:37,755][__main__][INFO] - Starting iteration 282. [2026-03-25 18:55:37,761][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:55:37,763][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:55:42,910][__main__][INFO] - Number of regex retries in iteration 282: 0 [2026-03-25 18:55:42,911][__main__][INFO] - agents played in iteration 282 are Bob, Alice [2026-03-25 18:55:43,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:55:43,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:55:43,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:55:43,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:55:44,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:55:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:55:45,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:55:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:55:47,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:55:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:55:48,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:55:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:55:49,888][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:55:50,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:55:51,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:55:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:55:52,774][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:55:53,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:55:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:55:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:55:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:55:56,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:55:57,104][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:55:57,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:55:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:55:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:55:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:56:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:56:01,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:56:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:56:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:56:03,605][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:56:04,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:56:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:56:05,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:56:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:56:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:56:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:56:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:56:09,389][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:56:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:56:10,835][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:56:11,558][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:56:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:56:13,004][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:56:13,727][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:56:14,450][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:56:15,174][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:56:15,896][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:56:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:56:17,341][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:56:18,066][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:56:19,034][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:56:19,759][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:56:20,480][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:56:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:56:21,928][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:56:22,651][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:56:23,376][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:56:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:56:24,822][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:56:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:56:26,373][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:56:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:56:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:56:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:56:29,267][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:56:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:56:30,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:56:31,507][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:56:33,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:56:33,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:56:33,012][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:56:34,436][__main__][INFO] - Iteration 283 took 56s (9.08% Gen, 88.40% Train). Generation: 5s, Training: 50s. Estimated remaining time: 11h 12m 37s. Estimated total time: 15h 44m 37s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 18s. [2026-03-25 18:56:34,440][__main__][INFO] - Starting iteration 283. [2026-03-25 18:56:34,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:56:34,446][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:56:39,626][__main__][INFO] - Number of regex retries in iteration 283: 0 [2026-03-25 18:56:39,627][__main__][INFO] - agents played in iteration 283 are Bob, Alice [2026-03-25 18:56:40,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:56:40,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:56:40,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:56:40,190][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:56:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:56:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:56:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:56:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:56:43,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:56:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:56:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:56:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:56:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:56:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:56:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:56:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:56:49,465][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:56:50,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:56:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:56:51,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:56:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:56:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:56:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:56:54,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:56:55,240][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:56:55,963][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:56:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:56:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:56:58,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:56:58,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:56:59,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:57:00,297][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:57:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:57:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:57:02,465][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:57:03,187][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:57:03,911][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:57:04,636][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:57:05,359][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:57:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:57:06,805][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:57:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:57:08,251][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:57:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:57:09,699][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:57:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:57:11,147][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:57:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:57:12,593][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:57:13,316][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:57:14,041][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:57:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:57:15,765][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:57:16,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:57:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:57:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:57:18,658][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:57:19,381][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:57:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:57:20,830][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:57:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:57:22,280][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:57:23,007][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:57:23,730][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:57:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:57:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:57:25,902][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:57:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:57:27,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:57:28,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:57:29,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:57:29,209][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:57:29,211][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:57:30,618][__main__][INFO] - Iteration 284 took 56s (9.22% Gen, 88.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 3m 18s. Estimated total time: 15h 36m 14s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 7s. [2026-03-25 18:57:30,621][__main__][INFO] - Starting iteration 284. [2026-03-25 18:57:30,625][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:57:30,626][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:57:35,741][__main__][INFO] - Number of regex retries in iteration 284: 0 [2026-03-25 18:57:35,742][__main__][INFO] - agents played in iteration 284 are Bob, Alice [2026-03-25 18:57:36,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:57:36,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:57:36,377][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:57:36,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:57:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:57:37,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:57:38,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:57:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:57:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:57:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:57:41,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:57:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:57:42,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:57:43,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:57:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:57:44,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:57:45,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:57:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:57:47,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:57:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:57:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:57:49,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:57:49,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:57:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:57:51,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:57:52,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:57:52,883][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:57:53,604][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:57:54,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:57:55,052][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:57:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:57:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:57:57,219][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:57:57,942][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:57:58,665][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:57:59,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:58:00,113][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:58:00,835][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:58:01,559][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:58:02,281][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:58:03,004][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:58:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:58:04,453][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:58:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:58:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:58:06,624][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:58:07,346][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:58:08,070][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:58:08,798][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:58:09,522][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:58:10,247][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:58:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:58:11,936][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:58:12,661][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:58:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:58:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:58:14,833][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:58:15,557][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:58:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:58:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:58:17,728][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:58:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:58:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:58:19,904][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:58:20,628][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:58:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:58:22,076][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:58:22,800][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:58:23,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:58:24,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:58:25,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:58:25,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:58:25,324][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:58:26,742][__main__][INFO] - Iteration 285 took 56s (9.12% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 1m 26s. Estimated total time: 15h 35m 18s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 31s, 500 more iterations: 7h 47m 39s. [2026-03-25 18:58:26,745][__main__][INFO] - Starting iteration 285. [2026-03-25 18:58:26,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:58:26,750][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:58:31,899][__main__][INFO] - Number of regex retries in iteration 285: 0 [2026-03-25 18:58:31,900][__main__][INFO] - agents played in iteration 285 are Bob, Alice [2026-03-25 18:58:32,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:58:32,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:58:32,517][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:58:32,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:58:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:58:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:58:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:58:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:58:36,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:58:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:58:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:58:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:58:38,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:58:39,698][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:58:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:58:41,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:58:41,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:58:42,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:58:43,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:58:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:58:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:58:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:58:46,210][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:58:46,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:58:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:58:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:58:49,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:58:49,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:58:50,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:58:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:58:51,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:58:52,719][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:58:53,444][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:58:54,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:58:54,893][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:58:55,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:58:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:58:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:58:57,789][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:58:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:58:59,236][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:58:59,960][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:59:00,686][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:59:01,410][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:59:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:59:02,860][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:59:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:59:04,309][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:59:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:59:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:59:06,481][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:59:07,207][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:59:08,182][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:59:08,909][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:59:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:59:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:59:11,089][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:59:11,815][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:59:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:59:13,265][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:59:13,991][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:59:14,715][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:59:15,441][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:59:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:59:16,891][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:59:17,617][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:59:18,342][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:59:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:59:19,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:59:20,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 18:59:21,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:59:21,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:59:21,602][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:59:23,130][__main__][INFO] - Iteration 286 took 56s (9.13% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 4m 54s. Estimated total time: 15h 39m 43s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 58s, 500 more iterations: 7h 49m 51s. [2026-03-25 18:59:23,133][__main__][INFO] - Starting iteration 286. [2026-03-25 18:59:23,137][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 18:59:23,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:59:28,407][__main__][INFO] - Number of regex retries in iteration 286: 0 [2026-03-25 18:59:28,408][__main__][INFO] - agents played in iteration 286 are Bob, Alice [2026-03-25 18:59:28,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:59:29,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 18:59:29,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 18:59:29,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 18:59:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:59:30,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:59:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:59:31,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:59:32,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:59:33,296][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:59:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:59:34,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:59:35,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:59:36,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:59:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:59:37,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:59:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:59:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:59:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:59:40,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:59:41,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:59:41,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:59:42,707][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:59:43,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:59:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:59:44,875][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:59:45,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:59:46,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:59:47,045][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:59:47,770][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:59:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:59:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:59:49,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:59:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:59:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:59:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:59:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:59:53,566][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:59:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:59:55,014][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:59:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:59:56,461][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:59:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:59:57,910][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:59:58,638][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:59:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:00:00,089][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:00:00,816][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:00:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:00:02,269][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:00:02,994][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:00:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:00:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:00:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:00:06,161][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:00:06,885][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:00:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:00:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:00:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:00:09,786][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:00:10,512][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:00:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:00:11,961][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:00:12,687][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:00:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:00:14,135][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:00:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:00:15,583][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:00:16,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:00:17,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:00:18,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:00:18,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:00:18,267][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:00:19,730][__main__][INFO] - Iteration 287 took 56s (9.31% Gen, 88.10% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 7m 29s. Estimated total time: 15h 43m 14s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 19s, 500 more iterations: 7h 51m 37s. [2026-03-25 19:00:19,733][__main__][INFO] - Starting iteration 287. [2026-03-25 19:00:19,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:00:19,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:00:24,939][__main__][INFO] - Number of regex retries in iteration 287: 0 [2026-03-25 19:00:24,940][__main__][INFO] - agents played in iteration 287 are Bob, Alice [2026-03-25 19:00:25,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:00:25,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:00:25,502][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:00:25,502][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:00:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:00:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:00:27,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:00:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:00:29,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:00:29,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:00:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:00:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:00:31,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:00:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:00:33,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:00:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:00:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:00:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:00:36,225][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:00:36,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:00:37,670][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:00:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:00:39,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:00:39,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:00:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:00:41,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:00:42,013][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:00:42,739][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:00:43,463][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:00:44,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:00:44,907][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:00:45,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:00:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:00:47,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:00:47,798][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:00:48,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:00:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:00:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:00:50,693][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:00:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:00:52,138][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:00:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:00:53,589][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:00:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:00:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:00:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:00:56,482][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:00:57,207][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:00:57,931][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:00:58,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:00:59,380][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:01:00,101][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:01:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:01:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:01:02,510][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:01:03,235][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:01:03,960][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:01:04,683][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:01:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:01:06,130][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:01:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:01:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:01:08,303][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:01:09,028][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:01:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:01:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:01:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:01:11,929][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:01:12,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:01:13,388][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:01:14,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:01:14,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:01:14,429][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:01:15,815][__main__][INFO] - Iteration 288 took 56s (9.27% Gen, 88.25% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 57m 58s. Estimated total time: 15h 34m 39s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 19s. [2026-03-25 19:01:15,817][__main__][INFO] - Starting iteration 288. [2026-03-25 19:01:15,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:01:15,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:01:20,969][__main__][INFO] - Number of regex retries in iteration 288: 0 [2026-03-25 19:01:20,970][__main__][INFO] - agents played in iteration 288 are Bob, Alice [2026-03-25 19:01:21,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:01:21,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:01:21,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:01:21,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:01:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:01:22,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:01:23,615][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:01:24,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:01:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:01:25,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:01:26,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:01:27,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:01:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:01:28,670][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:01:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:01:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:01:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:01:31,561][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:01:32,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:01:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:01:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:01:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:01:35,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:01:35,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:01:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:01:37,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:01:38,075][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:01:38,799][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:01:39,522][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:01:40,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:01:40,970][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:01:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:01:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:01:43,142][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:01:43,866][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:01:44,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:01:45,315][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:01:46,040][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:01:46,765][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:01:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:01:48,216][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:01:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:01:49,665][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:01:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:01:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:01:51,840][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:01:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:01:53,292][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:01:54,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:01:54,744][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:01:55,469][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:01:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:01:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:01:57,892][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:01:58,617][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:01:59,343][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:02:00,069][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:02:00,794][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:02:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:02:02,249][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:02:02,974][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:02:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:02:04,426][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:02:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:02:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:02:06,606][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:02:07,333][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:02:08,060][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:02:08,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:02:09,596][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:02:10,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:02:10,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:02:10,750][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:02:14,022][__main__][INFO] - Iteration 289 took 58s (8.84% Gen, 85.53% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 32m 22s. Estimated total time: 16h 10m 2s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 0s, 500 more iterations: 8h 5m 1s. [2026-03-25 19:02:14,026][__main__][INFO] - Starting iteration 289. [2026-03-25 19:02:14,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:02:14,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:02:19,194][__main__][INFO] - Number of regex retries in iteration 289: 0 [2026-03-25 19:02:19,195][__main__][INFO] - agents played in iteration 289 are Bob, Alice [2026-03-25 19:02:19,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:02:19,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:02:19,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:02:19,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:02:20,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:02:21,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:02:21,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:02:22,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:02:23,290][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:02:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:02:24,736][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:02:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:02:26,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:02:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:02:27,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:02:28,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:02:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:02:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:02:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:02:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:02:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:02:32,688][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:02:33,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:02:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:02:34,856][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:02:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:02:36,304][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:02:37,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:02:37,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:02:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:02:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:02:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:02:40,647][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:02:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:02:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:02:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:02:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:02:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:02:44,992][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:02:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:02:46,441][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:02:47,163][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:02:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:02:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:02:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:02:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:02:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:02:51,511][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:02:52,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:02:52,958][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:02:53,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:02:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:02:55,407][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:02:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:02:56,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:02:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:02:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:02:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:02:59,758][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:03:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:03:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:03:01,931][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:03:02,659][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:03:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:03:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:03:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:03:05,564][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:03:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:03:07,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:03:07,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:03:08,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:03:08,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:03:08,857][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:03:10,158][__main__][INFO] - Iteration 290 took 56s (9.20% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 56m 53s. Estimated total time: 15h 35m 29s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 32s, 500 more iterations: 7h 47m 44s. [2026-03-25 19:03:10,160][__main__][INFO] - Starting iteration 290. [2026-03-25 19:03:10,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:03:10,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:03:15,475][__main__][INFO] - Number of regex retries in iteration 290: 0 [2026-03-25 19:03:15,477][__main__][INFO] - agents played in iteration 290 are Bob, Alice [2026-03-25 19:03:15,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:03:16,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:03:16,049][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:03:16,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:03:16,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:03:17,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:03:18,119][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:03:18,843][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:03:19,566][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:03:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:03:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:03:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:03:22,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:03:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:03:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:03:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:03:25,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:03:26,069][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:03:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:03:27,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:03:28,240][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:03:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:03:29,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:03:30,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:03:31,133][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:03:31,856][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:03:32,580][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:03:33,305][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:03:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:03:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:03:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:03:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:03:36,925][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:03:37,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:03:38,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:03:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:03:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:03:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:03:41,269][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:03:41,994][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:03:42,718][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:03:43,442][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:03:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:03:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:03:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:03:46,337][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:03:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:03:47,788][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:03:48,513][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:03:49,238][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:03:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:03:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:03:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:03:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:03:53,094][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:03:53,818][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:03:54,542][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:03:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:03:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:03:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:03:57,440][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:03:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:03:58,890][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:03:59,615][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:04:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:04:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:04:01,791][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:04:02,515][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:04:03,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:04:03,976][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:04:05,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:04:05,087][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:04:05,089][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:04:06,498][__main__][INFO] - Iteration 291 took 56s (9.43% Gen, 88.07% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 59m 23s. Estimated total time: 15h 38m 55s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 27s. [2026-03-25 19:04:06,501][__main__][INFO] - Starting iteration 291. [2026-03-25 19:04:06,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:04:06,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:04:11,652][__main__][INFO] - Number of regex retries in iteration 291: 0 [2026-03-25 19:04:11,653][__main__][INFO] - agents played in iteration 291 are Bob, Alice [2026-03-25 19:04:12,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:04:12,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:04:12,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:04:12,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:04:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:04:13,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:04:14,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:04:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:04:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:04:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:04:17,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:04:17,948][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:04:18,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:04:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:04:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:04:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:04:21,563][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:04:22,287][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:04:23,011][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:04:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:04:24,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:04:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:04:25,903][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:04:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:04:27,351][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:04:28,076][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:04:28,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:04:29,525][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:04:30,249][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:04:30,973][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:04:31,697][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:04:32,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:04:33,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:04:33,869][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:04:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:04:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:04:36,042][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:04:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:04:37,491][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:04:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:04:38,939][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:04:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:04:40,385][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:04:41,110][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:04:41,835][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:04:42,560][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:04:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:04:44,010][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:04:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:04:45,459][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:04:46,184][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:04:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:04:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:04:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:04:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:04:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:04:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:04:51,495][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:04:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:04:52,947][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:04:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:04:54,399][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:04:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:04:55,852][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:04:56,577][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:04:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:04:58,029][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:04:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:04:59,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:05:00,239][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:05:01,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:05:03,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:05:03,143][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:05:08,087][__main__][INFO] - Iteration 292 took 1m 1s (8.36% Gen, 83.61% Train). Generation: 5s, Training: 51s. Estimated remaining time: 12h 25m 50s. Estimated total time: 17h 6m 23s. Time estimates for 10 more iterations: 10m 15s, 100 more iterations: 1h 42m 38s, 500 more iterations: 8h 33m 11s. [2026-03-25 19:05:08,091][__main__][INFO] - Starting iteration 292. [2026-03-25 19:05:08,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:05:08,097][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:05:13,438][__main__][INFO] - Number of regex retries in iteration 292: 0 [2026-03-25 19:05:13,439][__main__][INFO] - agents played in iteration 292 are Bob, Alice [2026-03-25 19:05:14,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:05:14,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:05:14,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:05:14,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:05:14,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:05:15,451][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:05:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:05:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:05:17,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:05:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:05:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:05:19,777][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:05:20,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:05:21,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:05:21,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:05:22,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:05:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:05:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:05:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:05:25,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:05:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:05:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:05:27,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:05:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:05:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:05:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:05:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:05:31,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:05:32,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:05:32,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:05:33,487][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:05:34,210][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:05:34,930][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:05:35,651][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:05:36,374][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:05:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:05:37,821][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:05:38,542][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:05:39,266][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:05:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:05:40,713][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:05:41,435][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:05:42,159][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:05:42,885][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:05:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:05:44,330][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:05:45,053][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:05:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:05:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:05:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:05:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:05:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:05:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:05:50,454][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:05:51,179][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:05:51,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:05:52,629][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:05:53,355][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:05:54,080][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:05:54,805][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:05:55,530][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:05:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:05:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:05:57,703][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:05:58,428][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:05:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:05:59,879][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:06:00,604][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:06:01,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:06:02,071][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:06:03,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:06:03,095][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:06:03,097][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:06:04,443][__main__][INFO] - Iteration 293 took 56s (9.48% Gen, 88.13% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 57m 38s. Estimated total time: 15h 39m 8s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 34s. [2026-03-25 19:06:04,446][__main__][INFO] - Starting iteration 293. [2026-03-25 19:06:04,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:06:04,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:06:12,284][__main__][INFO] - Number of regex retries in iteration 293: 0 [2026-03-25 19:06:12,285][__main__][INFO] - agents played in iteration 293 are Bob, Alice [2026-03-25 19:06:12,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:06:12,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:06:12,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:06:12,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:06:13,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:06:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:06:14,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:06:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:06:16,386][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:06:17,109][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:06:17,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:06:18,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:06:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:06:19,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:06:20,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:06:21,442][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:06:22,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:06:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:06:23,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:06:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:06:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:06:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:06:26,494][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:06:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:06:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:06:28,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:06:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:06:30,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:06:30,827][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:06:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:06:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:06:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:06:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:06:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:06:35,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:06:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:06:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:06:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:06:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:06:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:06:39,508][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:06:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:06:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:06:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:06:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:06:43,124][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:06:43,848][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:06:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:06:45,296][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:06:46,022][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:06:46,747][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:06:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:06:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:06:49,155][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:06:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:06:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:06:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:06:52,050][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:06:52,776][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:06:53,501][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:06:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:06:54,950][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:06:55,678][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:06:56,403][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:06:57,127][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:06:57,851][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:06:58,573][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:06:59,298][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:07:00,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:07:00,771][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:07:01,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:07:01,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:07:01,750][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:07:03,330][__main__][INFO] - Iteration 294 took 58s (13.31% Gen, 84.01% Train). Generation: 7s, Training: 49s. Estimated remaining time: 11h 38m 53s. Estimated total time: 16h 21m 22s. Time estimates for 10 more iterations: 9m 48s, 100 more iterations: 1h 38m 8s, 500 more iterations: 8h 10m 41s. [2026-03-25 19:07:03,332][__main__][INFO] - Starting iteration 294. [2026-03-25 19:07:03,336][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:07:03,337][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:07:08,639][__main__][INFO] - Number of regex retries in iteration 294: 0 [2026-03-25 19:07:08,640][__main__][INFO] - agents played in iteration 294 are Bob, Alice [2026-03-25 19:07:09,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:07:09,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:07:09,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:07:09,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:07:09,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:07:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:07:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:07:12,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:07:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:07:13,468][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:07:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:07:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:07:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:07:16,357][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:07:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:07:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:07:18,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:07:19,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:07:19,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:07:20,696][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:07:21,419][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:07:22,142][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:07:22,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:07:23,586][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:07:24,310][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:07:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:07:25,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:07:26,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:07:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:07:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:07:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:07:29,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:07:30,101][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:07:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:07:31,547][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:07:32,270][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:07:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:07:33,719][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:07:34,442][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:07:35,164][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:07:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:07:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:07:37,340][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:07:38,064][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:07:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:07:39,511][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:07:40,234][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:07:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:07:41,684][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:07:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:07:43,131][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:07:43,854][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:07:44,812][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:07:45,538][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:07:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:07:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:07:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:07:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:07:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:07:49,887][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:07:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:07:51,335][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:07:52,061][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:07:52,785][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:07:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:07:54,235][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:07:54,961][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:07:55,685][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:07:56,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:07:57,165][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:07:58,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:07:58,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:07:58,283][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:07:59,712][__main__][INFO] - Iteration 295 took 56s (9.41% Gen, 88.06% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 56m 12s. Estimated total time: 15h 39m 37s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 57s, 500 more iterations: 7h 49m 48s. [2026-03-25 19:07:59,715][__main__][INFO] - Starting iteration 295. [2026-03-25 19:07:59,719][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:07:59,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:08:04,872][__main__][INFO] - Number of regex retries in iteration 295: 0 [2026-03-25 19:08:04,873][__main__][INFO] - agents played in iteration 295 are Bob, Alice [2026-03-25 19:08:05,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:08:05,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:08:05,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:08:05,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:08:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:08:06,820][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:08:07,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:08:08,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:08:08,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:08:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:08:10,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:08:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:08:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:08:12,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:08:13,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:08:14,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:08:14,768][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:08:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:08:16,213][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:08:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:08:17,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:08:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:08:19,103][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:08:19,827][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:08:20,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:08:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:08:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:08:22,720][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:08:23,443][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:08:24,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:08:24,891][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:08:25,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:08:26,339][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:08:27,062][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:08:27,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:08:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:08:29,234][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:08:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:08:30,683][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:08:31,407][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:08:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:08:32,856][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:08:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:08:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:08:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:08:35,754][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:08:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:08:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:08:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:08:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:08:39,378][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:08:40,103][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:08:41,096][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:08:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:08:42,553][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:08:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:08:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:08:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:08:45,452][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:08:46,178][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:08:46,900][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:08:47,626][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:08:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:08:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:08:49,801][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:08:50,526][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:08:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:08:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:08:52,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:08:53,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:08:54,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:08:54,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:08:54,430][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:08:55,855][__main__][INFO] - Iteration 296 took 56s (9.18% Gen, 88.28% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 51m 16s. Estimated total time: 15h 35m 37s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 33s, 500 more iterations: 7h 47m 48s. [2026-03-25 19:08:55,857][__main__][INFO] - Starting iteration 296. [2026-03-25 19:08:55,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:08:55,863][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:09:02,498][__main__][INFO] - Number of regex retries in iteration 296: 0 [2026-03-25 19:09:02,500][__main__][INFO] - agents played in iteration 296 are Bob, Alice [2026-03-25 19:09:02,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:09:03,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:09:03,062][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:09:03,063][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:09:03,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:09:04,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:09:05,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:09:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:09:06,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:09:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:09:08,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:09:08,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:09:09,470][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:09:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:09:10,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:09:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:09:12,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:09:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:09:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:09:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:09:15,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:09:15,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:09:16,697][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:09:17,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:09:18,146][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:09:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:09:19,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:09:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:09:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:09:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:09:22,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:09:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:09:23,936][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:09:24,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:09:25,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:09:26,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:09:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:09:27,556][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:09:28,279][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:09:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:09:29,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:09:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:09:31,175][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:09:31,899][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:09:32,624][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:09:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:09:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:09:34,794][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:09:35,518][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:09:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:09:36,969][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:09:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:09:38,647][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:09:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:09:40,097][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:09:40,821][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:09:41,546][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:09:42,271][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:09:42,996][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:09:43,720][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:09:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:09:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:09:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:09:46,623][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:09:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:09:48,070][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:09:48,797][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:09:49,523][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:09:50,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:09:50,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:09:52,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:09:52,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:09:52,188][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:09:53,629][__main__][INFO] - Iteration 297 took 57s (11.49% Gen, 86.01% Train). Generation: 6s, Training: 49s. Estimated remaining time: 11h 17m 29s. Estimated total time: 16h 2m 49s. Time estimates for 10 more iterations: 9m 37s, 100 more iterations: 1h 36m 16s, 500 more iterations: 8h 1m 24s. [2026-03-25 19:09:53,631][__main__][INFO] - Starting iteration 297. [2026-03-25 19:09:53,635][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:09:53,636][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:09:58,970][__main__][INFO] - Number of regex retries in iteration 297: 0 [2026-03-25 19:09:58,971][__main__][INFO] - agents played in iteration 297 are Bob, Alice [2026-03-25 19:09:59,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:09:59,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:09:59,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:09:59,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:10:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:10:00,909][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:10:01,635][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:10:02,356][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:10:03,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:10:03,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:10:04,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:10:05,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:10:05,968][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:10:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:10:07,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:10:08,134][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:10:08,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:10:09,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:10:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:10:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:10:11,752][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:10:12,472][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:10:13,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:10:13,920][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:10:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:10:15,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:10:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:10:16,814][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:10:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:10:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:10:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:10:19,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:10:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:10:21,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:10:21,879][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:10:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:10:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:10:24,051][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:10:24,774][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:10:25,500][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:10:26,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:10:26,949][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:10:27,674][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:10:28,399][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:10:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:10:29,847][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:10:30,572][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:10:31,296][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:10:32,020][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:10:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:10:33,469][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:10:34,194][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:10:35,153][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:10:35,878][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:10:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:10:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:10:38,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:10:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:10:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:10:40,227][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:10:40,952][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:10:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:10:42,403][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:10:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:10:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:10:44,577][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:10:45,301][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:10:46,027][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:10:46,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:10:47,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:10:48,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:10:48,479][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:10:48,480][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:10:49,826][__main__][INFO] - Iteration 298 took 56s (9.50% Gen, 88.10% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 50m 17s. Estimated total time: 15h 36m 32s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 16s. [2026-03-25 19:10:49,828][__main__][INFO] - Starting iteration 298. [2026-03-25 19:10:49,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:10:49,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:10:55,184][__main__][INFO] - Number of regex retries in iteration 298: 0 [2026-03-25 19:10:55,185][__main__][INFO] - agents played in iteration 298 are Bob, Alice [2026-03-25 19:10:55,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:10:55,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:10:55,745][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:10:55,746][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:10:56,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:10:57,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:10:57,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:10:58,546][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:10:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:10:59,993][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:11:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:11:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:11:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:11:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:11:03,607][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:11:04,331][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:11:05,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:11:05,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:11:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:11:07,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:11:07,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:11:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:11:09,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:11:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:11:10,841][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:11:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:11:12,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:11:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:11:13,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:11:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:11:15,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:11:15,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:11:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:11:17,355][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:11:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:11:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:11:19,526][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:11:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:11:20,976][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:11:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:11:22,427][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:11:23,152][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:11:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:11:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:11:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:11:26,052][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:11:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:11:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:11:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:11:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:11:29,672][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:11:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:11:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:11:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:11:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:11:33,625][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:11:34,350][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:11:35,075][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:11:35,802][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:11:36,527][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:11:37,253][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:11:37,981][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:11:38,708][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:11:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:11:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:11:40,884][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:11:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:11:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:11:43,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:11:43,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:11:44,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:11:44,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:11:44,870][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:11:46,245][__main__][INFO] - Iteration 299 took 56s (9.49% Gen, 88.07% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 53m 2s. Estimated total time: 15h 40m 14s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 7s. [2026-03-25 19:11:46,247][__main__][INFO] - Starting iteration 299. [2026-03-25 19:11:46,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:11:46,252][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:11:51,417][__main__][INFO] - Number of regex retries in iteration 299: 0 [2026-03-25 19:11:51,419][__main__][INFO] - agents played in iteration 299 are Bob, Alice [2026-03-25 19:11:51,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:11:52,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:11:52,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:11:52,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:11:52,783][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:11:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:11:54,159][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:11:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:11:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:11:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:11:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:11:57,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:11:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:11:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:11:59,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:12:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:12:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:12:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:12:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:12:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:12:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:12:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:12:05,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:12:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:12:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:12:07,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:12:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:12:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:12:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:12:10,791][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:12:11,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:12:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:12:12,965][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:12:13,687][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:12:14,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:12:15,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:12:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:12:16,584][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:12:17,308][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:12:18,033][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:12:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:12:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:12:20,204][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:12:20,928][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:12:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:12:22,378][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:12:23,105][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:12:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:12:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:12:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:12:26,003][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:12:26,729][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:12:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:12:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:12:29,132][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:12:29,856][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:12:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:12:31,306][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:12:32,032][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:12:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:12:33,481][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:12:34,204][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:12:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:12:35,653][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:12:36,378][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:12:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:12:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:12:38,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:12:39,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:12:40,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:12:41,092][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:12:41,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:12:41,097][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:12:42,570][__main__][INFO] - Iteration 300 took 56s (9.17% Gen, 88.21% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 50m 32s. Estimated total time: 15h 38m 40s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 52s, 500 more iterations: 7h 49m 20s. [2026-03-25 19:12:42,572][__main__][INFO] - Starting iteration 300. [2026-03-25 19:12:42,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 19:12:42,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:12:47,920][__main__][INFO] - Number of regex retries in iteration 300: 0 [2026-03-25 19:12:47,921][__main__][INFO] - agents played in iteration 300 are Bob, Alice [2026-03-25 19:12:48,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:12:48,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:12:48,532][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:12:48,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:12:49,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:12:49,876][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:12:50,600][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:12:51,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:12:52,045][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:12:52,766][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:12:53,488][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:12:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:12:54,934][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:12:55,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:12:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:12:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:12:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:12:58,548][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:12:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:12:59,991][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:13:00,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:13:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:13:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:13:02,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:13:03,606][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:13:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:13:05,053][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:13:05,777][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:13:06,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:13:07,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:13:07,948][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:13:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:13:09,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:13:10,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:13:10,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:13:11,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:13:12,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:13:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:13:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:13:14,464][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:13:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:13:15,912][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:13:16,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:13:17,361][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:13:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:13:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:13:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:13:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:13:20,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:13:21,710][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:13:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:13:23,162][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:13:24,117][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:13:24,843][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:13:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:13:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:13:27,017][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:13:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:13:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:13:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:13:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:13:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:13:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:13:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:13:32,819][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:13:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:13:34,268][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:13:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:13:35,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:13:36,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:13:37,681][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:13:37,684][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:13:37,686][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:13:42,483][__main__][INFO] - Iteration 301 took 59s (8.92% Gen, 83.07% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 49m 21s. Estimated total time: 16h 38m 29s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 50s, 500 more iterations: 8h 19m 14s. [2026-03-25 19:13:42,486][__main__][INFO] - Starting iteration 301. [2026-03-25 19:13:42,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:13:42,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:13:47,642][__main__][INFO] - Number of regex retries in iteration 301: 0 [2026-03-25 19:13:47,644][__main__][INFO] - agents played in iteration 301 are Bob, Alice [2026-03-25 19:13:48,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:13:48,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:13:48,220][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:13:48,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:13:48,910][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:13:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:13:50,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:13:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:13:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:13:52,447][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:13:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:13:53,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:13:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:13:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:13:56,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:13:56,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:13:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:13:58,219][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:13:58,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:13:59,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:14:00,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:14:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:14:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:14:02,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:14:03,274][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:14:03,996][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:14:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:14:05,444][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:14:06,167][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:14:06,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:14:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:14:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:14:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:14:09,785][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:14:10,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:14:11,232][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:14:11,955][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:14:12,679][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:14:13,403][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:14:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:14:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:14:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:14:16,299][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:14:17,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:14:17,747][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:14:18,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:14:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:14:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:14:20,643][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:14:21,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:14:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:14:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:14:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:14:24,599][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:14:25,322][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:14:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:14:26,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:14:27,495][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:14:28,219][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:14:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:14:29,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:14:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:14:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:14:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:14:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:14:33,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:14:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:14:34,747][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:14:35,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:14:36,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:14:37,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:14:37,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:14:37,415][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:14:38,800][__main__][INFO] - Iteration 302 took 56s (9.15% Gen, 88.39% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 48m 27s. Estimated total time: 15h 38m 31s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 15s. [2026-03-25 19:14:38,803][__main__][INFO] - Starting iteration 302. [2026-03-25 19:14:38,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:14:38,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:14:44,041][__main__][INFO] - Number of regex retries in iteration 302: 0 [2026-03-25 19:14:44,042][__main__][INFO] - agents played in iteration 302 are Bob, Alice [2026-03-25 19:14:44,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:14:44,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:14:44,610][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:14:44,610][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:14:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:14:45,951][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:14:46,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:14:47,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:14:48,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:14:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:14:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:14:50,289][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:14:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:14:51,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:14:52,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:14:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:14:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:14:54,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:14:55,355][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:14:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:14:56,802][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:14:57,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:14:58,251][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:14:58,975][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:14:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:15:00,426][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:15:01,149][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:15:01,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:15:02,598][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:15:03,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:15:04,047][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:15:04,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:15:05,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:15:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:15:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:15:07,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:15:08,399][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:15:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:15:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:15:10,574][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:15:11,298][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:15:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:15:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:15:15,516][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:15:16,240][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:15:16,963][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:15:17,689][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:15:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:15:19,138][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:15:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:15:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:15:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:15:23,019][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:15:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:15:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:15:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:15:25,917][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:15:26,643][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:15:27,368][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:15:28,092][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:15:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:15:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:15:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:15:30,992][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:15:31,717][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:15:32,442][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:15:33,168][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:15:33,893][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:15:34,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:15:35,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 19:15:36,604][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:15:36,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:15:36,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:15:38,020][__main__][INFO] - Iteration 303 took 59s (8.84% Gen, 88.78% Train). Generation: 5s, Training: 52s. Estimated remaining time: 11h 35m 51s. Estimated total time: 16h 26m 54s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 41s, 500 more iterations: 8h 13m 27s. [2026-03-25 19:15:38,026][__main__][INFO] - Starting iteration 303. [2026-03-25 19:15:38,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:15:38,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:15:43,247][__main__][INFO] - Number of regex retries in iteration 303: 0 [2026-03-25 19:15:43,248][__main__][INFO] - agents played in iteration 303 are Bob, Alice [2026-03-25 19:15:43,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:15:43,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:15:43,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:15:43,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:15:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:15:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:15:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:15:46,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:15:47,351][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:15:48,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:15:48,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:15:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:15:50,239][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:15:50,963][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:15:51,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:15:52,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:15:53,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:15:53,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:15:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:15:55,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:15:56,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:15:56,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:15:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:15:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:15:58,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:15:59,637][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:16:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:16:01,085][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:16:01,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:16:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:16:03,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:16:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:16:04,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:16:05,426][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:16:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:16:06,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:16:07,600][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:16:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:16:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:16:09,770][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:16:10,495][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:16:11,219][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:16:11,943][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:16:12,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:16:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:16:14,121][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:16:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:16:15,570][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:16:16,296][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:16:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:16:17,746][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:16:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:16:19,433][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:16:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:16:20,886][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:16:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:16:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:16:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:16:23,787][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:16:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:16:25,235][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:16:25,961][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:16:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:16:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:16:28,138][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:16:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:16:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:16:30,312][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:16:31,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:16:31,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:16:32,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:16:32,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:16:32,925][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:16:34,290][__main__][INFO] - Iteration 304 took 56s (9.27% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 45m 39s. Estimated total time: 15h 37m 39s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 49s. [2026-03-25 19:16:34,293][__main__][INFO] - Starting iteration 304. [2026-03-25 19:16:34,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:16:34,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:16:39,791][__main__][INFO] - Number of regex retries in iteration 304: 0 [2026-03-25 19:16:39,793][__main__][INFO] - agents played in iteration 304 are Bob, Alice [2026-03-25 19:16:40,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:16:40,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:16:40,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:16:40,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:16:41,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:16:41,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:16:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:16:43,183][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:16:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:16:44,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:16:45,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:16:46,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:16:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:16:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:16:48,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:16:48,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:16:49,688][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:16:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:16:51,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:16:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:16:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:16:53,306][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:16:54,031][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:16:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:16:55,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:16:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:16:56,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:16:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:16:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:16:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:16:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:17:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:17:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:17:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:17:02,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:17:03,446][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:17:04,171][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:17:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:17:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:17:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:17:07,069][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:17:07,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:17:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:17:09,248][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:17:09,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:17:10,700][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:17:11,426][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:17:12,152][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:17:12,878][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:17:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:17:14,332][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:17:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:17:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:17:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:17:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:17:18,285][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:17:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:17:19,735][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:17:20,463][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:17:21,189][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:17:21,915][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:17:22,643][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:17:23,369][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:17:24,095][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:17:24,822][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:17:25,549][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:17:26,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:17:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:17:27,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:17:28,488][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:17:29,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:17:29,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:17:29,629][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:17:31,074][__main__][INFO] - Iteration 305 took 56s (9.68% Gen, 87.77% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 53m 22s. Estimated total time: 15h 46m 18s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 37s, 500 more iterations: 7h 53m 9s. [2026-03-25 19:17:31,078][__main__][INFO] - Starting iteration 305. [2026-03-25 19:17:31,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:17:31,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:17:37,658][__main__][INFO] - Number of regex retries in iteration 305: 0 [2026-03-25 19:17:37,659][__main__][INFO] - agents played in iteration 305 are Bob, Alice [2026-03-25 19:17:38,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:17:38,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:17:38,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:17:38,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:17:38,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:17:39,638][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:17:40,361][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:17:41,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:17:41,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:17:42,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:17:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:17:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:17:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:17:45,416][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:17:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:17:46,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:17:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:17:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:17:49,032][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:17:49,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:17:50,477][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:17:51,202][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:17:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:17:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:17:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:17:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:17:54,819][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:17:55,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:17:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:17:56,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:17:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:17:58,440][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:17:59,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:17:59,887][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:18:00,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:18:01,337][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:18:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:18:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:18:03,512][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:18:04,239][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:18:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:18:05,689][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:18:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:18:07,141][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:18:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:18:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:18:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:18:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:18:10,762][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:18:11,485][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:18:12,211][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:18:12,936][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:18:13,898][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:18:14,625][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:18:15,349][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:18:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:18:16,800][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:18:17,525][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:18:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:18:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:18:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:18:20,425][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:18:21,150][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:18:21,876][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:18:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:18:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:18:24,052][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:18:24,778][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:18:25,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:18:26,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:18:27,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:18:27,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:18:27,496][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:18:29,005][__main__][INFO] - Iteration 306 took 57s (11.35% Gen, 86.04% Train). Generation: 6s, Training: 49s. Estimated remaining time: 11h 11m 28s. Estimated total time: 16h 5m 23s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 32s, 500 more iterations: 8h 2m 41s. [2026-03-25 19:18:29,009][__main__][INFO] - Starting iteration 306. [2026-03-25 19:18:29,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:18:29,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:18:34,130][__main__][INFO] - Number of regex retries in iteration 306: 0 [2026-03-25 19:18:34,131][__main__][INFO] - agents played in iteration 306 are Bob, Alice [2026-03-25 19:18:34,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:18:34,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:18:34,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:18:34,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:18:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:18:36,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:18:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:18:37,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:18:38,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:18:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:18:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:18:40,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:18:41,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:18:41,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:18:42,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:18:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:18:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:18:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:18:45,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:18:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:18:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:18:47,684][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:18:48,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:18:49,136][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:18:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:18:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:18:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:18:52,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:18:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:18:53,482][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:18:54,206][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:18:54,931][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:18:55,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:18:56,382][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:18:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:18:57,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:18:58,561][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:18:59,287][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:19:00,013][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:19:00,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:19:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:19:02,192][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:19:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:19:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:19:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:19:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:19:05,822][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:19:06,548][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:19:07,275][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:19:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:19:08,728][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:19:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:19:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:19:11,136][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:19:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:19:12,586][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:19:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:19:14,040][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:19:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:19:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:19:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:19:16,948][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:19:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:19:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:19:19,139][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:19:19,865][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:19:20,595][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:19:21,321][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:19:22,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:19:22,819][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:19:23,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:19:23,885][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:19:23,886][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:19:25,409][__main__][INFO] - Iteration 307 took 56s (9.07% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 45m 5s. Estimated total time: 15h 39m 56s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 59s, 500 more iterations: 7h 49m 58s. [2026-03-25 19:19:25,412][__main__][INFO] - Starting iteration 307. [2026-03-25 19:19:25,417][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:19:25,418][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:19:30,683][__main__][INFO] - Number of regex retries in iteration 307: 0 [2026-03-25 19:19:30,684][__main__][INFO] - agents played in iteration 307 are Bob, Alice [2026-03-25 19:19:31,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:19:31,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:19:31,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:19:31,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:19:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:19:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:19:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:19:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:19:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:19:35,514][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:19:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:19:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:19:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:19:38,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:19:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:19:39,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:19:40,584][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:19:41,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:19:42,037][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:19:42,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:19:43,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:19:44,208][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:19:44,932][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:19:45,656][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:19:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:19:47,106][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:19:47,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:19:48,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:19:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:19:50,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:19:50,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:19:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:19:52,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:19:52,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:19:53,631][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:19:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:19:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:19:55,805][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:19:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:19:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:19:57,981][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:19:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:19:59,430][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:20:00,156][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:20:00,882][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:20:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:20:02,333][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:20:03,058][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:20:03,784][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:20:04,509][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:20:05,235][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:20:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:20:06,974][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:20:07,700][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:20:08,424][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:20:09,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:20:09,878][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:20:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:20:11,328][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:20:12,054][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:20:12,778][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:20:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:20:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:20:14,956][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:20:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:20:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:20:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:20:17,861][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:20:18,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:20:19,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:20:20,390][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:20:20,395][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:20:20,397][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:20:21,763][__main__][INFO] - Iteration 308 took 56s (9.35% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 43m 21s. Estimated total time: 15h 39m 8s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 34s. [2026-03-25 19:20:21,766][__main__][INFO] - Starting iteration 308. [2026-03-25 19:20:21,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:20:21,793][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:20:27,063][__main__][INFO] - Number of regex retries in iteration 308: 0 [2026-03-25 19:20:27,064][__main__][INFO] - agents played in iteration 308 are Bob, Alice [2026-03-25 19:20:27,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:20:27,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:20:27,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:20:27,631][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:20:28,323][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:20:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:20:29,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:20:30,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:20:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:20:31,867][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:20:32,590][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:20:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:20:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:20:34,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:20:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:20:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:20:36,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:20:37,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:20:38,378][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:20:39,104][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:20:39,828][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:20:40,551][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:20:41,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:20:41,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:20:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:20:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:20:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:20:44,896][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:20:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:20:46,344][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:20:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:20:47,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:20:48,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:20:49,240][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:20:49,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:20:50,691][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:20:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:20:52,141][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:20:52,864][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:20:53,590][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:20:54,315][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:20:55,038][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:20:55,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:20:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:20:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:20:57,940][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:20:58,663][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:20:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:21:00,113][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:21:00,838][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:21:01,565][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:21:02,290][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:21:03,254][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:21:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:21:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:21:05,531][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:21:06,257][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:21:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:21:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:21:08,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:21:09,159][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:21:09,885][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:21:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:21:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:21:12,062][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:21:12,788][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:21:13,513][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:21:14,239][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:21:14,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:21:15,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:21:16,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:21:16,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:21:16,784][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:21:18,434][__main__][INFO] - Iteration 309 took 56s (9.31% Gen, 87.77% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 47m 19s. Estimated total time: 15h 44m 3s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 1s. [2026-03-25 19:21:18,438][__main__][INFO] - Starting iteration 309. [2026-03-25 19:21:18,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:21:18,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:21:23,726][__main__][INFO] - Number of regex retries in iteration 309: 0 [2026-03-25 19:21:23,727][__main__][INFO] - agents played in iteration 309 are Bob, Alice [2026-03-25 19:21:24,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:21:24,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:21:24,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:21:24,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:21:25,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:21:25,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:21:26,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:21:27,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:21:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:21:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:21:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:21:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:21:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:21:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:21:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:21:32,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:21:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:21:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:21:35,065][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:21:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:21:36,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:21:37,238][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:21:37,962][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:21:38,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:21:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:21:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:21:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:21:41,588][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:21:42,313][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:21:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:21:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:21:44,493][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:21:45,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:21:45,944][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:21:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:21:47,397][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:21:48,123][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:21:48,850][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:21:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:21:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:21:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:21:51,755][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:21:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:21:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:21:53,932][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:21:54,658][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:21:55,382][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:21:56,107][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:21:56,833][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:21:57,558][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:21:58,285][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:21:59,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:21:59,977][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:22:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:22:01,431][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:22:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:22:02,886][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:22:03,613][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:22:04,339][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:22:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:22:05,792][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:22:06,517][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:22:07,243][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:22:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:22:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:22:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:22:10,152][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:22:10,878][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:22:11,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:22:12,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:22:13,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:22:13,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:22:13,351][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:22:14,730][__main__][INFO] - Iteration 310 took 56s (9.39% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 40m 28s. Estimated total time: 15h 38m 8s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 4s. [2026-03-25 19:22:14,734][__main__][INFO] - Starting iteration 310. [2026-03-25 19:22:14,750][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:22:14,751][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:22:19,929][__main__][INFO] - Number of regex retries in iteration 310: 0 [2026-03-25 19:22:19,930][__main__][INFO] - agents played in iteration 310 are Bob, Alice [2026-03-25 19:22:20,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:22:20,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:22:20,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:22:20,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:22:21,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:22:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:22:22,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:22:23,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:22:24,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:22:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:22:25,462][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:22:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:22:26,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:22:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:22:28,358][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:22:29,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:22:29,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:22:30,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:22:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:22:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:22:32,708][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:22:33,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:22:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:22:34,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:22:35,611][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:22:36,337][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:22:37,062][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:22:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:22:38,517][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:22:39,244][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:22:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:22:40,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:22:41,424][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:22:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:22:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:22:43,604][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:22:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:22:45,057][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:22:45,784][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:22:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:22:47,238][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:22:47,965][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:22:48,691][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:22:49,416][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:22:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:22:50,868][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:22:51,595][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:22:52,322][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:22:53,049][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:22:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:22:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:22:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:22:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:22:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:22:57,711][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:22:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:22:59,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:22:59,890][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:23:00,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:23:01,344][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:23:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:23:02,798][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:23:03,524][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:23:04,250][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:23:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:23:05,705][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:23:06,432][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:23:07,161][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:23:07,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:23:08,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:23:09,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:23:09,746][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:23:09,747][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:23:11,182][__main__][INFO] - Iteration 311 took 56s (9.18% Gen, 88.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 41m 58s. Estimated total time: 15h 40m 35s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 3s, 500 more iterations: 7h 50m 17s. [2026-03-25 19:23:11,185][__main__][INFO] - Starting iteration 311. [2026-03-25 19:23:11,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:23:11,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:23:16,523][__main__][INFO] - Number of regex retries in iteration 311: 0 [2026-03-25 19:23:16,525][__main__][INFO] - agents played in iteration 311 are Bob, Alice [2026-03-25 19:23:17,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:23:17,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:23:17,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:23:17,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:23:17,794][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:23:18,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:23:19,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:23:19,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:23:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:23:21,342][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:23:22,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:23:22,791][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:23:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:23:24,240][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:23:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:23:25,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:23:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:23:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:23:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:23:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:23:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:23:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:23:30,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:23:31,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:23:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:23:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:23:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:23:34,384][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:23:35,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:23:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:23:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:23:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:23:38,014][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:23:38,740][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:23:39,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:23:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:23:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:23:41,644][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:23:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:23:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:23:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:23:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:23:45,277][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:23:46,003][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:23:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:23:47,457][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:23:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:23:48,909][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:23:49,636][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:23:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:23:51,089][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:23:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:23:52,778][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:23:53,505][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:23:54,233][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:23:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:23:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:23:56,414][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:23:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:23:57,867][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:23:58,594][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:23:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:24:00,050][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:24:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:24:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:24:02,233][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:24:02,959][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:24:03,687][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:24:04,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:24:05,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:24:06,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:24:06,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:24:06,124][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:24:07,531][__main__][INFO] - Iteration 312 took 56s (9.47% Gen, 88.03% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 39m 30s. Estimated total time: 15h 39m 3s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 31s. [2026-03-25 19:24:07,533][__main__][INFO] - Starting iteration 312. [2026-03-25 19:24:07,538][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:24:07,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:24:12,786][__main__][INFO] - Number of regex retries in iteration 312: 0 [2026-03-25 19:24:12,787][__main__][INFO] - agents played in iteration 312 are Bob, Alice [2026-03-25 19:24:13,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:24:13,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:24:13,373][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:24:13,373][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:24:14,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:24:14,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:24:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:24:16,194][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:24:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:24:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:24:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:24:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:24:19,813][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:24:20,536][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:24:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:24:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:24:22,711][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:24:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:24:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:24:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:24:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:24:26,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:24:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:24:27,783][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:24:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:24:29,230][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:24:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:24:30,679][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:24:31,404][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:24:32,130][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:24:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:24:33,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:24:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:24:35,035][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:24:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:24:40,649][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:24:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:24:42,096][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:24:42,818][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:24:43,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:24:44,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:24:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:24:45,713][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:24:46,438][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:24:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:24:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:24:48,607][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:24:49,332][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:24:50,056][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:24:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:24:51,506][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:24:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:24:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:24:53,915][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:24:54,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:24:55,364][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:24:56,089][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:24:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:24:57,538][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:24:58,264][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:24:58,990][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:24:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:25:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:25:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:25:01,895][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:25:02,621][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:25:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:25:04,073][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:25:04,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:25:05,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 19:25:06,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:25:06,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:25:06,627][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:25:08,060][__main__][INFO] - Iteration 313 took 1m 0s (8.67% Gen, 88.95% Train). Generation: 5s, Training: 53s. Estimated remaining time: 11h 48m 11s. Estimated total time: 16h 48m 44s. Time estimates for 10 more iterations: 10m 5s, 100 more iterations: 1h 40m 52s, 500 more iterations: 8h 24m 22s. [2026-03-25 19:25:08,065][__main__][INFO] - Starting iteration 313. [2026-03-25 19:25:08,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:25:08,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:25:13,418][__main__][INFO] - Number of regex retries in iteration 313: 0 [2026-03-25 19:25:13,419][__main__][INFO] - agents played in iteration 313 are Bob, Alice [2026-03-25 19:25:13,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:25:14,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:25:14,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:25:14,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:25:14,761][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:25:15,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:25:16,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:25:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:25:17,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:25:18,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:25:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:25:19,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:25:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:25:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:25:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:25:22,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:25:23,369][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:25:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:25:24,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:25:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:25:26,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:25:26,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:25:27,709][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:25:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:25:29,156][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:25:29,880][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:25:30,605][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:25:31,327][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:25:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:25:32,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:25:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:25:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:25:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:25:35,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:25:36,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:25:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:25:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:25:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:25:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:25:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:25:40,749][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:25:41,475][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:25:42,201][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:25:42,927][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:25:43,654][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:25:44,382][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:25:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:25:45,834][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:25:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:25:47,287][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:25:48,012][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:25:48,738][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:25:49,752][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:25:50,479][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:25:51,203][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:25:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:25:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:25:53,379][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:25:54,105][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:25:54,831][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:25:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:25:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:25:57,010][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:25:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:25:58,463][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:25:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:25:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:26:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:26:01,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:26:02,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:26:03,186][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:26:03,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:26:03,193][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:26:04,538][__main__][INFO] - Iteration 314 took 56s (9.47% Gen, 88.14% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 39m 39s. Estimated total time: 15h 41m 9s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 6s, 500 more iterations: 7h 50m 34s. [2026-03-25 19:26:04,541][__main__][INFO] - Starting iteration 314. [2026-03-25 19:26:04,545][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:26:04,546][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:26:09,713][__main__][INFO] - Number of regex retries in iteration 314: 0 [2026-03-25 19:26:09,714][__main__][INFO] - agents played in iteration 314 are Bob, Alice [2026-03-25 19:26:10,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:26:10,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:26:10,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:26:10,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:26:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:26:11,669][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:26:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:26:13,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:26:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:26:14,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:26:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:26:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:26:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:26:17,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:26:18,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:26:18,915][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:26:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:26:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:26:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:26:21,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:26:22,539][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:26:23,265][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:26:23,991][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:26:24,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:26:25,441][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:26:26,165][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:26:26,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:26:27,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:26:28,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:26:29,068][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:26:29,795][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:26:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:26:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:26:31,972][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:26:32,699][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:26:33,424][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:26:34,152][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:26:34,877][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:26:35,602][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:26:36,327][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:26:37,053][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:26:37,779][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:26:38,505][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:26:39,232][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:26:39,958][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:26:40,685][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:26:41,410][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:26:42,138][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:26:42,864][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:26:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:26:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:26:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:26:46,025][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:26:46,752][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:26:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:26:48,203][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:26:48,929][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:26:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:26:50,381][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:26:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:26:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:26:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:26:53,288][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:26:54,013][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:26:54,741][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:26:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:26:56,193][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:26:56,918][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:26:57,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:26:58,387][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:26:59,453][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:26:59,456][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:26:59,459][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:27:00,860][__main__][INFO] - Iteration 315 took 56s (9.18% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 36m 10s. Estimated total time: 15h 38m 36s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 18s. [2026-03-25 19:27:00,863][__main__][INFO] - Starting iteration 315. [2026-03-25 19:27:00,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:27:00,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:27:06,119][__main__][INFO] - Number of regex retries in iteration 315: 0 [2026-03-25 19:27:06,120][__main__][INFO] - agents played in iteration 315 are Bob, Alice [2026-03-25 19:27:06,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:27:06,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:27:06,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:27:06,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:27:07,382][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:27:08,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:27:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:27:09,484][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:27:10,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:27:10,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:27:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:27:12,384][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:27:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:27:13,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:27:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:27:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:27:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:27:16,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:27:17,456][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:27:18,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:27:18,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:27:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:27:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:27:21,083][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:27:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:27:22,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:27:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:27:23,987][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:27:24,713][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:27:25,438][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:27:26,162][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:27:26,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:27:27,611][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:27:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:27:29,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:27:29,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:27:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:27:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:27:31,965][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:27:32,692][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:27:33,419][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:27:34,146][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:27:34,874][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:27:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:27:36,329][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:27:37,056][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:27:37,784][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:27:38,511][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:27:39,239][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:27:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:27:40,694][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:27:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:27:42,382][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:27:43,108][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:27:43,835][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:27:44,560][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:27:45,286][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:27:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:27:46,739][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:27:47,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:27:48,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:27:48,921][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:27:49,647][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:27:50,373][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:27:51,099][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:27:51,824][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:27:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:27:53,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:27:54,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:27:54,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:27:55,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:27:55,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:27:55,812][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:27:57,183][__main__][INFO] - Iteration 316 took 56s (9.32% Gen, 88.23% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 35m 14s. Estimated total time: 15h 38m 37s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 18s. [2026-03-25 19:27:57,186][__main__][INFO] - Starting iteration 316. [2026-03-25 19:27:57,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:27:57,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:28:05,734][__main__][INFO] - Number of regex retries in iteration 316: 0 [2026-03-25 19:28:05,736][__main__][INFO] - agents played in iteration 316 are Bob, Alice [2026-03-25 19:28:06,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:28:06,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:28:06,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:28:06,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:28:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:28:07,655][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:28:08,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:28:09,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:28:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:28:10,545][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:28:11,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:28:11,989][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:28:12,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:28:13,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:28:14,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:28:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:28:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:28:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:28:17,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:28:17,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:28:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:28:19,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:28:19,945][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:28:20,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:28:21,391][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:28:22,114][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:28:22,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:28:23,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:28:24,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:28:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:28:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:28:26,460][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:28:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:28:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:28:28,633][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:28:29,358][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:28:30,082][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:28:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:28:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:28:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:28:32,981][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:28:33,707][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:28:34,431][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:28:35,155][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:28:35,879][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:28:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:28:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:28:38,052][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:28:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:28:39,505][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:28:40,231][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:28:40,957][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:28:41,974][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:28:42,700][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:28:43,425][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:28:44,150][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:28:44,877][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:28:45,603][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:28:46,328][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:28:47,053][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:28:47,779][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:28:48,501][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:28:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:28:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:28:50,677][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:28:51,404][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:28:52,131][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:28:52,859][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:28:53,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:28:54,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:28:58,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:28:58,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:28:58,389][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:29:00,029][__main__][INFO] - Iteration 317 took 1m 2s (13.60% Gen, 83.79% Train). Generation: 8s, Training: 52s. Estimated remaining time: 12h 22m 55s. Estimated total time: 17h 27m 21s. Time estimates for 10 more iterations: 10m 28s, 100 more iterations: 1h 44m 44s, 500 more iterations: 8h 43m 40s. [2026-03-25 19:29:00,032][__main__][INFO] - Starting iteration 317. [2026-03-25 19:29:00,036][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:29:00,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:29:05,332][__main__][INFO] - Number of regex retries in iteration 317: 0 [2026-03-25 19:29:05,334][__main__][INFO] - agents played in iteration 317 are Bob, Alice [2026-03-25 19:29:05,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:29:05,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:29:05,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:29:05,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:29:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:29:07,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:29:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:29:08,712][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:29:09,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:29:10,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:29:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:29:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:29:12,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:29:13,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:29:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:29:14,487][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:29:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:29:15,931][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:29:16,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:29:17,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:29:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:29:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:29:19,543][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:29:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:29:20,987][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:29:21,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:29:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:29:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:29:23,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:29:24,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:29:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:29:26,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:29:26,773][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:29:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:29:28,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:29:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:29:29,670][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:29:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:29:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:29:31,843][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:29:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:29:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:29:34,014][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:29:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:29:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:29:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:29:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:29:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:29:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:29:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:29:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:29:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:29:41,499][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:29:42,224][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:29:42,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:29:43,673][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:29:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:29:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:29:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:29:46,576][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:29:47,302][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:29:48,027][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:29:48,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:29:49,474][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:29:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:29:50,925][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:29:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:29:52,377][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:29:53,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:29:53,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:29:55,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:29:55,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:29:55,014][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:29:56,972][__main__][INFO] - Iteration 318 took 56s (9.30% Gen, 87.25% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 43m 35s. Estimated total time: 15h 48m 57s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 53s, 500 more iterations: 7h 54m 28s. [2026-03-25 19:29:56,975][__main__][INFO] - Starting iteration 318. [2026-03-25 19:29:56,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:29:56,980][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:30:02,415][__main__][INFO] - Number of regex retries in iteration 318: 0 [2026-03-25 19:30:02,416][__main__][INFO] - agents played in iteration 318 are Bob, Alice [2026-03-25 19:30:02,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:30:02,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:30:02,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:30:02,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:30:03,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:30:04,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:30:05,079][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:30:05,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:30:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:30:07,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:30:07,968][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:30:08,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:30:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:30:10,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:30:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:30:11,583][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:30:12,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:30:13,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:30:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:30:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:30:15,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:30:15,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:30:16,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:30:17,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:30:18,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:30:18,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:30:19,545][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:30:20,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:30:20,993][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:30:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:30:22,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:30:23,168][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:30:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:30:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:30:25,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:30:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:30:26,788][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:30:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:30:28,235][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:30:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:30:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:30:30,413][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:30:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:30:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:30:32,590][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:30:33,316][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:30:34,041][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:30:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:30:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:30:36,219][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:30:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:30:37,672][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:30:38,633][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:30:39,360][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:30:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:30:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:30:41,535][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:30:42,262][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:30:42,986][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:30:43,713][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:30:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:30:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:30:45,890][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:30:46,615][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:30:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:30:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:30:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:30:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:30:50,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:30:50,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:30:52,096][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:30:52,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:30:52,102][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:30:53,522][__main__][INFO] - Iteration 319 took 56s (9.61% Gen, 87.87% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 36m 5s. Estimated total time: 15h 42m 25s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 14s, 500 more iterations: 7h 51m 12s. [2026-03-25 19:30:53,525][__main__][INFO] - Starting iteration 319. [2026-03-25 19:30:53,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:30:53,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:30:58,807][__main__][INFO] - Number of regex retries in iteration 319: 0 [2026-03-25 19:30:58,808][__main__][INFO] - agents played in iteration 319 are Bob, Alice [2026-03-25 19:30:59,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:30:59,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:30:59,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:30:59,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:31:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:31:00,708][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:31:01,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:31:02,155][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:31:02,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:31:03,603][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:31:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:31:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:31:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:31:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:31:07,222][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:31:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:31:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:31:09,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:31:10,118][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:31:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:31:11,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:31:12,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:31:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:31:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:31:14,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:31:15,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:31:15,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:31:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:31:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:31:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:31:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:31:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:31:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:31:20,988][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:31:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:31:22,436][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:31:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:31:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:31:24,609][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:31:25,335][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:31:26,063][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:31:26,787][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:31:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:31:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:31:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:31:29,690][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:31:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:31:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:31:31,869][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:31:32,595][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:31:33,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:31:34,048][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:31:35,065][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:31:35,792][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:31:36,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:31:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:31:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:31:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:31:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:31:40,143][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:31:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:31:41,596][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:31:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:31:43,048][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:31:43,774][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:31:44,501][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:31:45,227][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:31:45,954][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:31:46,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:31:47,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:31:50,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:31:50,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:31:50,487][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:31:51,844][__main__][INFO] - Iteration 320 took 58s (9.05% Gen, 88.62% Train). Generation: 5s, Training: 51s. Estimated remaining time: 11h 4m 40s. Estimated total time: 16h 11m 57s. Time estimates for 10 more iterations: 9m 43s, 100 more iterations: 1h 37m 11s, 500 more iterations: 8h 5m 58s. [2026-03-25 19:31:51,847][__main__][INFO] - Starting iteration 320. [2026-03-25 19:31:51,853][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:31:51,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:31:54,431][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2026-03-25 19:31:57,131][__main__][INFO] - Number of regex retries in iteration 320: 1 [2026-03-25 19:31:57,133][__main__][INFO] - agents played in iteration 320 are Bob, Alice [2026-03-25 19:31:57,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:31:57,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:31:57,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:31:57,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:31:58,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:31:59,174][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:31:59,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:32:00,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:32:01,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:32:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:32:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:32:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:32:04,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:32:04,953][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:32:05,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:32:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:32:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:32:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:32:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:32:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:32:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:32:10,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:32:11,463][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:32:12,190][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:32:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:32:13,643][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:32:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:32:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:32:15,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:32:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:32:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:32:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:32:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:32:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:32:20,162][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:32:20,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:32:21,614][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:32:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:32:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:32:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:32:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:32:25,239][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:32:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:32:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:32:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:32:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:32:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:32:29,593][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:32:30,318][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:32:31,043][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:32:31,768][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:32:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:32:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:32:34,189][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:32:34,914][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:32:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:32:36,368][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:32:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:32:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:32:38,544][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:32:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:32:39,996][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:32:40,720][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:32:41,446][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:32:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:32:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:32:43,626][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:32:44,351][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:32:45,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:32:45,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:32:46,894][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:32:46,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:32:46,899][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:32:48,302][__main__][INFO] - Iteration 321 took 56s (9.35% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 32m 38s. Estimated total time: 15h 40m 52s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 5s, 500 more iterations: 7h 50m 26s. [2026-03-25 19:32:48,304][__main__][INFO] - Starting iteration 321. [2026-03-25 19:32:48,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:32:48,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:32:53,564][__main__][INFO] - Number of regex retries in iteration 321: 0 [2026-03-25 19:32:53,565][__main__][INFO] - agents played in iteration 321 are Bob, Alice [2026-03-25 19:32:54,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:32:54,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:32:54,188][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:32:54,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:32:54,890][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:32:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:32:56,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:32:56,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:32:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:32:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:32:59,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:32:59,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:33:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:33:01,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:33:02,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:33:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:33:03,499][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:33:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:33:04,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:33:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:33:06,394][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:33:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:33:07,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:33:08,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:33:09,293][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:33:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:33:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:33:11,467][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:33:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:33:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:33:13,639][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:33:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:33:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:33:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:33:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:33:17,266][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:33:17,994][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:33:18,719][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:33:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:33:20,170][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:33:20,894][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:33:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:33:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:33:23,072][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:33:23,800][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:33:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:33:25,253][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:33:25,980][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:33:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:33:27,433][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:33:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:33:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:33:29,852][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:33:30,577][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:33:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:33:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:33:32,750][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:33:33,476][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:33:34,201][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:33:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:33:35,652][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:33:36,378][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:33:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:33:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:33:38,555][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:33:39,282][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:33:40,008][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:33:40,734][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:33:41,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:33:42,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:33:43,306][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:33:43,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:33:43,312][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:33:44,742][__main__][INFO] - Iteration 322 took 56s (9.31% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 31m 24s. Estimated total time: 15h 40m 35s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 3s, 500 more iterations: 7h 50m 17s. [2026-03-25 19:33:44,744][__main__][INFO] - Starting iteration 322. [2026-03-25 19:33:44,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:33:44,750][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:33:50,129][__main__][INFO] - Number of regex retries in iteration 322: 0 [2026-03-25 19:33:50,130][__main__][INFO] - agents played in iteration 322 are Bob, Alice [2026-03-25 19:33:50,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:33:50,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:33:50,714][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:33:50,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:33:51,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:33:52,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:33:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:33:53,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:33:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:33:54,948][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:33:55,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:33:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:33:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:33:57,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:33:58,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:33:59,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:34:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:34:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:34:01,458][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:34:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:34:02,905][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:34:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:34:04,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:34:05,077][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:34:05,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:34:06,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:34:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:34:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:34:08,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:34:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:34:10,155][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:34:10,880][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:34:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:34:12,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:34:13,051][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:34:16,357][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:34:17,081][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:34:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:34:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:34:19,251][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:34:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:34:20,701][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:34:21,429][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:34:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:34:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:34:23,600][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:34:24,324][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:34:25,048][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:34:25,773][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:34:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:34:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:34:27,946][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:34:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:34:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:34:30,370][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:34:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:34:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:34:32,547][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:34:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:34:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:34:34,721][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:34:35,445][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:34:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:34:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:34:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:34:38,344][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:34:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:34:39,794][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:34:40,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:34:41,327][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:49 [2026-03-25 19:34:42,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:34:42,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:34:42,808][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:34:48,930][__main__][INFO] - Iteration 323 took 1m 4s (8.38% Gen, 82.08% Train). Generation: 5s, Training: 52s. Estimated remaining time: 12h 39m 28s. Estimated total time: 17h 49m 43s. Time estimates for 10 more iterations: 10m 41s, 100 more iterations: 1h 46m 58s, 500 more iterations: 8h 54m 51s. [2026-03-25 19:34:48,935][__main__][INFO] - Starting iteration 323. [2026-03-25 19:34:48,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:34:48,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:34:54,272][__main__][INFO] - Number of regex retries in iteration 323: 0 [2026-03-25 19:34:54,273][__main__][INFO] - agents played in iteration 323 are Bob, Alice [2026-03-25 19:34:54,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:34:54,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:34:54,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:34:54,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:34:55,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:34:56,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:34:56,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:34:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:34:58,339][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:34:59,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:34:59,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:35:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:35:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:35:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:35:02,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:35:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:35:04,104][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:35:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:35:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:35:06,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:35:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:35:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:35:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:35:09,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:35:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:35:10,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:35:11,324][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:35:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:35:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:35:13,491][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:35:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:35:14,938][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:35:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:35:16,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:35:17,105][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:35:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:35:18,552][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:35:19,274][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:35:19,997][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:35:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:35:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:35:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:35:22,893][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:35:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:35:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:35:25,063][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:35:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:35:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:35:27,234][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:35:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:35:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:35:29,405][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:35:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:35:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:35:31,843][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:35:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:35:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:35:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:35:34,740][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:35:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:35:36,189][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:35:36,913][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:35:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:35:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:35:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:35:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:35:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:35:41,262][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:35:41,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:35:42,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:35:44,075][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:35:44,080][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:35:44,083][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:35:48,572][__main__][INFO] - Iteration 324 took 59s (8.94% Gen, 83.53% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 22m 39s. Estimated total time: 16h 33m 53s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 23s, 500 more iterations: 8h 16m 56s. [2026-03-25 19:35:48,575][__main__][INFO] - Starting iteration 324. [2026-03-25 19:35:48,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:35:48,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:35:55,445][__main__][INFO] - Number of regex retries in iteration 324: 0 [2026-03-25 19:35:55,446][__main__][INFO] - agents played in iteration 324 are Bob, Alice [2026-03-25 19:35:55,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:35:56,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:35:56,017][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:35:56,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:35:56,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:35:57,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:35:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:35:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:35:59,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:36:00,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:36:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:36:01,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:36:02,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:36:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:36:03,839][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:36:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:36:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:36:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:36:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:36:07,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:36:08,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:36:08,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:36:09,612][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:36:10,334][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:36:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:36:11,778][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:36:12,503][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:36:13,224][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:36:13,947][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:36:14,669][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:36:15,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:36:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:36:16,839][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:36:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:36:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:36:19,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:36:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:36:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:36:21,173][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:36:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:36:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:36:23,343][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:36:24,066][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:36:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:36:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:36:26,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:36:26,958][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:36:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:36:28,405][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:36:29,128][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:36:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:36:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:36:31,529][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:36:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:36:32,978][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:36:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:36:34,426][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:36:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:36:35,874][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:36:36,597][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:36:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:36:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:36:38,772][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:36:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:36:40,223][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:36:40,947][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:36:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:36:42,394][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:36:43,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:36:43,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:36:45,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:36:45,041][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:36:45,043][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:36:46,682][__main__][INFO] - Iteration 325 took 58s (11.82% Gen, 85.36% Train). Generation: 6s, Training: 49s. Estimated remaining time: 10h 56m 12s. Estimated total time: 16h 8m 24s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 50s, 500 more iterations: 8h 4m 12s. [2026-03-25 19:36:46,685][__main__][INFO] - Starting iteration 325. [2026-03-25 19:36:46,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:36:46,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:36:51,878][__main__][INFO] - Number of regex retries in iteration 325: 0 [2026-03-25 19:36:51,879][__main__][INFO] - agents played in iteration 325 are Bob, Alice [2026-03-25 19:36:52,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:36:52,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:36:52,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:36:52,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:36:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:36:53,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:36:54,543][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:36:55,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:36:55,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:36:56,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:36:57,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:36:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:36:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:36:59,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:37:00,317][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:37:01,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:37:01,762][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:37:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:37:03,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:37:03,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:37:04,653][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:37:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:37:06,100][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:37:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:37:07,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:37:08,273][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:37:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:37:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:37:12,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:37:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:37:14,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:37:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:37:15,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:37:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:37:16,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:37:17,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:37:18,348][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:37:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:37:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:37:20,515][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:37:21,240][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:37:21,963][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:37:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:37:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:37:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:37:24,858][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:37:25,582][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:37:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:37:27,031][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:37:27,756][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:37:28,482][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:37:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:37:30,241][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:37:30,969][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:37:31,691][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:37:32,417][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:37:33,143][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:37:33,868][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:37:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:37:35,315][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:37:36,041][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:37:36,766][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:37:37,491][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:37:38,217][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:37:38,942][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:37:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:37:40,394][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:37:41,119][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:37:41,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:37:42,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:49 [2026-03-25 19:37:43,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:37:43,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:37:43,851][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:37:45,285][__main__][INFO] - Iteration 326 took 58s (8.86% Gen, 88.69% Train). Generation: 5s, Training: 51s. Estimated remaining time: 11h 3m 26s. Estimated total time: 16h 16m 37s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 39s, 500 more iterations: 8h 8m 18s. [2026-03-25 19:37:45,288][__main__][INFO] - Starting iteration 326. [2026-03-25 19:37:45,294][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:37:45,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:37:51,023][__main__][INFO] - Number of regex retries in iteration 326: 0 [2026-03-25 19:37:51,024][__main__][INFO] - agents played in iteration 326 are Bob, Alice [2026-03-25 19:37:51,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:37:51,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:37:51,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:37:51,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:37:52,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:37:52,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:37:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:37:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:37:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:37:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:37:56,579][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:37:57,301][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:37:58,022][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:37:58,746][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:37:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:38:00,193][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:38:00,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:38:01,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:38:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:38:03,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:38:03,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:38:04,533][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:38:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:38:05,979][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:38:06,702][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:38:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:38:08,151][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:38:08,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:38:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:38:10,324][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:38:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:38:11,774][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:38:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:38:13,220][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:38:13,943][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:38:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:38:15,391][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:38:16,117][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:38:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:38:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:38:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:38:19,016][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:38:19,741][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:38:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:38:21,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:38:21,914][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:38:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:38:23,363][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:38:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:38:24,813][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:38:25,539][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:38:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:38:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:38:27,956][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:38:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:38:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:38:30,129][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:38:30,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:38:31,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:38:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:38:33,029][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:38:33,755][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:38:34,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:38:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:38:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:38:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:38:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:38:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:38:38,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:38:39,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:38:40,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:38:40,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:38:40,683][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:38:42,158][__main__][INFO] - Iteration 327 took 56s (10.07% Gen, 87.32% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 33m 39s. Estimated total time: 15h 47m 47s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 46s, 500 more iterations: 7h 53m 53s. [2026-03-25 19:38:42,161][__main__][INFO] - Starting iteration 327. [2026-03-25 19:38:42,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:38:42,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:38:47,449][__main__][INFO] - Number of regex retries in iteration 327: 0 [2026-03-25 19:38:47,450][__main__][INFO] - agents played in iteration 327 are Bob, Alice [2026-03-25 19:38:48,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:38:48,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:38:48,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:38:48,127][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:38:48,833][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:38:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:38:50,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:38:50,931][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:38:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:38:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:38:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:38:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:38:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:38:55,272][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:38:55,997][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:38:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:38:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:38:58,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:38:58,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:38:59,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:39:00,337][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:39:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:39:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:39:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:39:03,231][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:39:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:39:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:39:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:39:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:39:06,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:39:07,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:39:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:39:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:39:09,749][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:39:10,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:39:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:39:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:39:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:39:13,377][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:39:14,103][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:39:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:39:15,555][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:39:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:39:17,006][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:39:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:39:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:39:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:39:19,908][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:39:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:39:21,360][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:39:22,086][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:39:22,811][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:39:23,769][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:39:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:39:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:39:25,943][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:39:26,668][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:39:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:39:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:39:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:39:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:39:30,298][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:39:31,022][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:39:31,747][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:39:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:39:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:39:33,921][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:39:34,647][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:39:35,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:39:36,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:39:37,217][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:39:37,220][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:39:37,221][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:39:38,727][__main__][INFO] - Iteration 328 took 56s (9.34% Gen, 87.99% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 27m 39s. Estimated total time: 15h 42m 43s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 16s, 500 more iterations: 7h 51m 21s. [2026-03-25 19:39:38,729][__main__][INFO] - Starting iteration 328. [2026-03-25 19:39:38,733][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:39:38,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:39:45,941][__main__][INFO] - Number of regex retries in iteration 328: 0 [2026-03-25 19:39:45,942][__main__][INFO] - agents played in iteration 328 are Bob, Alice [2026-03-25 19:39:46,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:39:46,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:39:46,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:39:46,543][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:39:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:39:47,904][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:39:48,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:39:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:39:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:39:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:39:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:39:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:39:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:39:53,682][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:39:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:39:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:39:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:39:56,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:39:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:39:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:39:58,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:39:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:40:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:40:00,916][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:40:01,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:40:02,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:40:03,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:40:03,813][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:40:04,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:40:05,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:40:05,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:40:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:40:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:40:08,154][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:40:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:40:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:40:10,330][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:40:11,056][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:40:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:40:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:40:13,230][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:40:13,953][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:40:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:40:15,401][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:40:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:40:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:40:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:40:18,300][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:40:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:40:19,751][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:40:20,477][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:40:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:40:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:40:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:40:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:40:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:40:25,138][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:40:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:40:26,590][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:40:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:40:28,041][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:40:28,767][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:40:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:40:30,218][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:40:30,945][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:40:31,671][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:40:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:40:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:40:33,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:40:34,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:40:35,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:40:35,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:40:35,745][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:40:37,126][__main__][INFO] - Iteration 329 took 58s (12.34% Gen, 85.29% Train). Generation: 7s, Training: 49s. Estimated remaining time: 10h 57m 12s. Estimated total time: 16h 13m 14s. Time estimates for 10 more iterations: 9m 43s, 100 more iterations: 1h 37m 19s, 500 more iterations: 8h 6m 37s. [2026-03-25 19:40:37,129][__main__][INFO] - Starting iteration 329. [2026-03-25 19:40:37,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:40:37,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:40:42,419][__main__][INFO] - Number of regex retries in iteration 329: 0 [2026-03-25 19:40:42,420][__main__][INFO] - agents played in iteration 329 are Bob, Alice [2026-03-25 19:40:42,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:40:42,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:40:42,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:40:42,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:40:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:40:44,336][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:40:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:40:45,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:40:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:40:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:40:47,951][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:40:48,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:40:49,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:40:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:40:50,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:40:51,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:40:52,288][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:40:53,011][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:40:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:40:54,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:40:55,179][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:40:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:40:56,628][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:40:57,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:40:58,073][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:40:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:40:59,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:41:00,246][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:41:00,971][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:41:01,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:41:02,417][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:41:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:41:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:41:04,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:41:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:41:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:41:06,763][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:41:07,486][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:41:08,210][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:41:08,938][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:41:09,662][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:41:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:41:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:41:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:41:12,560][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:41:13,285][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:41:14,011][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:41:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:41:15,459][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:41:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:41:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:41:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:41:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:41:19,326][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:41:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:41:20,779][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:41:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:41:22,230][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:41:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:41:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:41:24,408][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:41:25,135][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:41:25,862][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:41:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:41:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:41:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:41:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:41:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:41:30,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:41:30,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:41:32,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:41:32,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:41:32,061][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:41:33,477][__main__][INFO] - Iteration 330 took 56s (9.38% Gen, 88.10% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 22m 6s. Estimated total time: 15h 39m 5s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 32s. [2026-03-25 19:41:33,480][__main__][INFO] - Starting iteration 330. [2026-03-25 19:41:33,484][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:41:33,485][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:41:38,661][__main__][INFO] - Number of regex retries in iteration 330: 0 [2026-03-25 19:41:38,662][__main__][INFO] - agents played in iteration 330 are Bob, Alice [2026-03-25 19:41:39,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:41:39,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:41:39,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:41:39,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:41:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:41:40,578][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:41:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:41:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:41:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:41:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:41:44,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:41:44,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:41:45,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:41:46,361][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:41:47,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:41:47,808][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:41:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:41:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:41:49,978][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:41:50,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:41:51,427][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:41:52,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:41:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:41:53,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:41:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:41:55,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:41:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:41:56,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:41:57,219][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:41:57,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:41:58,669][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:41:59,395][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:42:00,119][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:42:00,845][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:42:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:42:02,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:42:03,019][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:42:03,745][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:42:04,469][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:42:05,197][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:42:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:42:06,648][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:42:07,375][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:42:08,101][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:42:08,829][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:42:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:42:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:42:11,011][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:42:11,738][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:42:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:42:13,192][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:42:13,920][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:42:14,880][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:42:15,608][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:42:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:42:17,063][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:42:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:42:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:42:19,244][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:42:19,972][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:42:20,699][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:42:21,425][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:42:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:42:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:42:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:42:24,333][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:42:25,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:42:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:42:26,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:42:27,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:42:28,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:42:28,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:42:28,494][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:42:29,871][__main__][INFO] - Iteration 331 took 56s (9.18% Gen, 88.37% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 21m 53s. Estimated total time: 15h 39m 48s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 58s, 500 more iterations: 7h 49m 54s. [2026-03-25 19:42:29,874][__main__][INFO] - Starting iteration 331. [2026-03-25 19:42:29,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:42:29,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:42:35,276][__main__][INFO] - Number of regex retries in iteration 331: 0 [2026-03-25 19:42:35,277][__main__][INFO] - agents played in iteration 331 are Bob, Alice [2026-03-25 19:42:35,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:42:35,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:42:35,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:42:35,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:42:36,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:42:37,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:42:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:42:38,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:42:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:42:40,088][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:42:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:42:41,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:42:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:42:42,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:42:43,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:42:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:42:45,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:42:45,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:42:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:42:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:42:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:42:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:42:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:42:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:42:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:42:51,666][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:42:52,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:42:53,113][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:42:53,836][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:42:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:42:55,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:42:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:42:56,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:42:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:42:58,185][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:42:58,909][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:42:59,634][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:43:00,358][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:43:01,081][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:43:01,806][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:43:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:43:03,254][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:43:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:43:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:43:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:43:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:43:06,881][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:43:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:43:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:43:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:43:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:43:10,505][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:43:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:43:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:43:12,986][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:43:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:43:14,436][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:43:15,161][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:43:15,886][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:43:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:43:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:43:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:43:18,787][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:43:19,513][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:43:20,238][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:43:20,964][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:43:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:43:22,417][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:43:23,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:43:23,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:43:25,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:43:25,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:43:25,293][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:43:26,926][__main__][INFO] - Iteration 332 took 57s (9.46% Gen, 87.67% Train). Generation: 5s, Training: 50s. Estimated remaining time: 10h 31m 58s. Estimated total time: 15h 50m 50s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 5s, 500 more iterations: 7h 55m 25s. [2026-03-25 19:43:26,929][__main__][INFO] - Starting iteration 332. [2026-03-25 19:43:26,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:43:26,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:43:32,573][__main__][INFO] - Number of regex retries in iteration 332: 0 [2026-03-25 19:43:32,574][__main__][INFO] - agents played in iteration 332 are Bob, Alice [2026-03-25 19:43:33,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:43:33,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:43:33,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:43:33,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:43:33,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:43:34,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:43:35,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:43:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:43:36,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:43:37,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:43:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:43:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:43:39,564][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:43:40,285][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:43:41,009][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:43:41,733][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:43:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:43:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:43:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:43:44,628][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:43:45,351][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:43:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:43:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:43:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:43:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:43:48,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:43:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:43:50,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:43:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:43:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:43:52,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:43:53,315][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:43:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:43:54,763][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:43:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:43:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:43:56,939][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:43:57,665][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:43:58,391][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:43:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:43:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:44:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:44:01,291][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:44:02,015][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:44:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:44:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:44:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:44:04,913][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:44:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:44:06,363][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:44:07,089][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:44:07,815][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:44:08,775][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:44:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:44:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:44:10,949][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:44:11,673][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:44:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:44:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:44:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:44:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:44:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:44:16,026][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:44:16,753][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:44:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:44:18,206][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:44:18,934][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:44:19,662][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:44:20,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:44:21,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:44:22,360][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:44:22,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:44:22,367][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:44:23,791][__main__][INFO] - Iteration 333 took 56s (9.92% Gen, 87.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 27m 50s. Estimated total time: 15h 47m 40s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 46s, 500 more iterations: 7h 53m 50s. [2026-03-25 19:44:23,794][__main__][INFO] - Starting iteration 333. [2026-03-25 19:44:23,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:44:23,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:44:29,074][__main__][INFO] - Number of regex retries in iteration 333: 0 [2026-03-25 19:44:29,076][__main__][INFO] - agents played in iteration 333 are Bob, Alice [2026-03-25 19:44:29,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:44:29,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:44:29,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:44:29,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:44:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:44:30,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:44:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:44:32,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:44:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:44:33,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:44:34,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:44:35,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:44:36,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:44:36,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:44:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:44:38,227][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:44:38,954][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:44:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:44:40,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:44:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:44:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:44:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:44:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:44:44,020][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:44:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:44:45,470][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:44:46,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:44:46,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:44:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:44:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:44:49,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:44:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:44:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:44:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:44:51,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:44:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:44:53,448][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:44:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:44:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:44:55,622][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:44:56,347][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:44:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:44:57,798][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:44:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:44:59,250][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:44:59,976][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:45:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:45:01,430][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:45:02,155][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:45:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:45:03,609][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:45:04,333][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:45:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:45:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:45:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:45:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:45:08,194][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:45:08,919][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:45:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:45:10,371][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:45:11,098][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:45:11,823][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:45:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:45:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:45:14,001][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:45:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:45:15,454][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:45:16,180][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:45:16,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:45:17,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:45:18,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:45:18,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:45:18,895][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:45:20,306][__main__][INFO] - Iteration 334 took 56s (9.34% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 21m 3s. Estimated total time: 15h 41m 49s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 10s, 500 more iterations: 7h 50m 54s. [2026-03-25 19:45:20,308][__main__][INFO] - Starting iteration 334. [2026-03-25 19:45:20,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:45:20,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:45:25,451][__main__][INFO] - Number of regex retries in iteration 334: 0 [2026-03-25 19:45:25,452][__main__][INFO] - agents played in iteration 334 are Bob, Alice [2026-03-25 19:45:26,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:45:26,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:45:26,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:45:26,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:45:26,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:45:27,455][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:45:28,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:45:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:45:29,628][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:45:30,351][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:45:31,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:45:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:45:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:45:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:45:33,967][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:45:34,690][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:45:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:45:36,138][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:45:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:45:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:45:38,311][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:45:39,035][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:45:39,758][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:45:40,483][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:45:41,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:45:41,930][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:45:42,654][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:45:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:45:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:45:44,831][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:45:45,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:45:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:45:47,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:45:47,729][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:45:48,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:45:49,180][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:45:49,905][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:45:50,630][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:45:51,355][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:45:52,082][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:45:52,807][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:45:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:45:54,260][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:45:54,985][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:45:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:45:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:45:57,162][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:45:57,887][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:45:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:45:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:46:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:46:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:46:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:46:02,553][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:46:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:46:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:46:04,729][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:46:05,455][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:46:06,181][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:46:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:46:07,633][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:46:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:46:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:46:09,814][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:46:10,540][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:46:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:46:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:46:12,718][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:46:13,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:46:14,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:46:15,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:46:15,882][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:46:15,885][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:46:18,361][__main__][INFO] - Iteration 335 took 58s (8.85% Gen, 86.88% Train). Generation: 5s, Training: 50s. Estimated remaining time: 10h 45m 47s. Estimated total time: 16h 7m 31s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 45s, 500 more iterations: 8h 3m 45s. [2026-03-25 19:46:18,369][__main__][INFO] - Starting iteration 335. [2026-03-25 19:46:18,376][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:46:18,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:46:27,172][__main__][INFO] - Number of regex retries in iteration 335: 0 [2026-03-25 19:46:27,173][__main__][INFO] - agents played in iteration 335 are Bob, Alice [2026-03-25 19:46:27,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:46:27,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:46:27,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:46:27,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:46:28,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:46:29,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:46:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:46:30,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:46:31,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:46:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:46:32,763][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:46:33,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:46:34,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:46:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:46:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:46:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:46:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:46:37,813][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:46:38,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:46:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:46:39,978][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:46:40,701][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:46:41,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:46:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:46:42,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:46:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:46:44,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:46:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:46:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:46:46,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:46:47,197][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:46:47,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:46:48,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:46:49,365][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:46:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:46:50,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:46:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:46:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:46:52,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:46:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:46:54,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:46:55,150][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:46:55,872][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:46:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:46:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:46:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:46:58,765][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:46:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:47:00,214][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:47:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:47:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:47:02,383][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:47:03,340][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:47:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:47:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:47:05,512][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:47:06,236][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:47:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:47:07,683][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:47:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:47:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:47:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:47:10,585][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:47:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:47:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:47:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:47:13,484][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:47:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:47:14,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:47:15,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:47:18,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:47:18,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:47:18,108][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:47:20,153][__main__][INFO] - Iteration 336 took 1m 1s (14.24% Gen, 82.45% Train). Generation: 8s, Training: 50s. Estimated remaining time: 11h 46m 53s. Estimated total time: 17h 9m 39s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 57s, 500 more iterations: 8h 34m 49s. [2026-03-25 19:47:20,157][__main__][INFO] - Starting iteration 336. [2026-03-25 19:47:20,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:47:20,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:47:25,428][__main__][INFO] - Number of regex retries in iteration 336: 0 [2026-03-25 19:47:25,430][__main__][INFO] - agents played in iteration 336 are Bob, Alice [2026-03-25 19:47:25,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:47:25,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:47:25,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:47:25,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:47:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:47:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:47:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:47:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:47:29,516][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:47:30,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:47:30,962][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:47:31,683][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:47:32,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:47:33,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:47:33,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:47:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:47:35,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:47:36,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:47:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:47:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:47:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:47:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:47:39,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:47:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:47:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:47:41,798][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:47:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:47:43,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:47:43,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:47:44,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:47:45,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:47:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:47:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:47:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:47:48,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:47:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:47:49,756][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:47:50,482][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:47:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:47:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:47:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:47:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:47:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:47:54,830][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:47:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:47:56,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:47:57,361][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:47:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:47:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:47:59,537][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:48:00,266][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:48:00,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:48:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:48:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:48:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:48:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:48:04,892][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:48:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:48:06,342][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:48:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:48:07,792][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:48:08,516][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:48:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:48:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:48:10,691][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:48:11,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:48:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:48:12,867][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:48:13,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:48:14,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:48:15,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:48:15,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:48:15,570][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:48:16,974][__main__][INFO] - Iteration 337 took 56s (9.26% Gen, 88.25% Train). Generation: 5s, Training: 50s. Estimated remaining time: 10h 23m 11s. Estimated total time: 15h 46m 53s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 41s, 500 more iterations: 7h 53m 26s. [2026-03-25 19:48:16,986][__main__][INFO] - Starting iteration 337. [2026-03-25 19:48:17,006][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:48:17,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:48:22,498][__main__][INFO] - Number of regex retries in iteration 337: 0 [2026-03-25 19:48:22,499][__main__][INFO] - agents played in iteration 337 are Bob, Alice [2026-03-25 19:48:23,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:48:23,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:48:23,098][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:48:23,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:48:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:48:25,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:48:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:48:26,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:48:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:48:28,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:48:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:48:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:48:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:48:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:48:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:48:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:48:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:48:34,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:48:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:48:35,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:48:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:48:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:48:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:48:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:48:39,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:48:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:48:40,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:48:41,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:48:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:48:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:48:43,499][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:48:44,226][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:48:44,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:48:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:48:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:48:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:48:47,856][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:48:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:48:49,305][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:48:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:48:50,754][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:48:51,477][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:48:52,203][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:48:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:48:53,655][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:48:54,379][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:48:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:48:55,827][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:48:56,552][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:48:57,276][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:48:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:48:58,726][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:48:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:49:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:49:01,212][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:49:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:49:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:49:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:49:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:49:04,838][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:49:05,565][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:49:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:49:07,017][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:49:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:49:08,468][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:49:09,195][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:49:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:49:10,649][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:49:11,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:49:12,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 19:49:13,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:49:13,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:49:13,227][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:49:14,537][__main__][INFO] - Iteration 338 took 57s (9.55% Gen, 88.17% Train). Generation: 5s, Training: 50s. Estimated remaining time: 10h 34m 14s. Estimated total time: 15h 58m 55s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 53s, 500 more iterations: 7h 59m 27s. [2026-03-25 19:49:14,539][__main__][INFO] - Starting iteration 338. [2026-03-25 19:49:14,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:49:14,543][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:49:19,887][__main__][INFO] - Number of regex retries in iteration 338: 0 [2026-03-25 19:49:19,888][__main__][INFO] - agents played in iteration 338 are Bob, Alice [2026-03-25 19:49:20,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:49:20,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:49:20,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:49:20,458][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:49:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:49:21,802][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:49:22,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:49:23,252][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:49:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:49:24,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:49:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:49:26,145][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:49:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:49:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:49:28,316][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:49:29,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:49:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:49:30,486][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:49:31,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:49:31,934][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:49:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:49:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:49:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:49:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:49:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:49:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:49:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:49:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:49:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:49:39,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:49:39,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:49:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:49:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:49:42,078][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:49:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:49:43,530][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:49:44,256][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:49:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:49:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:49:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:49:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:49:47,882][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:49:48,607][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:49:49,332][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:49:50,056][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:49:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:49:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:49:52,232][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:49:52,958][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:49:53,684][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:49:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:49:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:49:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:49:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:49:57,539][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:49:58,265][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:49:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:49:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:50:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:50:01,170][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:50:01,896][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:50:02,622][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:50:03,349][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:50:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:50:04,802][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:50:05,528][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:50:06,254][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:50:06,980][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:50:07,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:50:08,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:50:09,630][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:50:09,633][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:50:09,635][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:50:11,039][__main__][INFO] - Iteration 339 took 56s (9.46% Gen, 88.05% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 16m 1s. Estimated total time: 15h 41m 38s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 49s. [2026-03-25 19:50:11,042][__main__][INFO] - Starting iteration 339. [2026-03-25 19:50:11,047][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:50:11,048][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:50:16,380][__main__][INFO] - Number of regex retries in iteration 339: 0 [2026-03-25 19:50:16,381][__main__][INFO] - agents played in iteration 339 are Bob, Alice [2026-03-25 19:50:16,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:50:16,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:50:16,943][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:50:16,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:50:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:50:18,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:50:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:50:19,740][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:50:20,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:50:21,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:50:21,910][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:50:22,634][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:50:23,359][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:50:24,082][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:50:24,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:50:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:50:26,254][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:50:26,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:50:27,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:50:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:50:29,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:50:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:50:30,600][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:50:31,325][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:50:32,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:50:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:50:33,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:50:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:50:34,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:50:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:50:36,394][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:50:37,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:50:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:50:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:50:39,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:50:40,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:50:40,746][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:50:41,471][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:50:42,198][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:50:42,923][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:50:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:50:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:50:45,097][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:50:45,824][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:50:46,549][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:50:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:50:48,003][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:50:48,729][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:50:49,456][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:50:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:50:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:50:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:50:52,596][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:50:53,322][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:50:54,049][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:50:54,775][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:50:55,500][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:50:56,227][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:50:56,954][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:50:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:50:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:50:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:50:59,860][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:51:00,587][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:51:01,315][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:51:02,042][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:51:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:51:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:51:04,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:51:04,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:51:06,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:51:06,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:51:06,149][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:51:07,725][__main__][INFO] - Iteration 340 took 56s (9.41% Gen, 87.80% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 18m 7s. Estimated total time: 15h 44m 41s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 28s, 500 more iterations: 7h 52m 20s. [2026-03-25 19:51:07,728][__main__][INFO] - Starting iteration 340. [2026-03-25 19:51:07,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:51:07,732][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:51:13,019][__main__][INFO] - Number of regex retries in iteration 340: 0 [2026-03-25 19:51:13,020][__main__][INFO] - agents played in iteration 340 are Bob, Alice [2026-03-25 19:51:13,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:51:13,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:51:13,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:51:13,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:51:14,357][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:51:15,011][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:51:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:51:16,458][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:51:17,182][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:51:17,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:51:18,631][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:51:19,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:51:20,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:51:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:51:21,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:51:22,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:51:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:51:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:51:24,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:51:25,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:51:25,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:51:26,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:51:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:51:28,041][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:51:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:51:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:51:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:51:30,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:51:31,664][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:51:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:51:33,115][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:51:33,842][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:51:34,569][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:51:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:51:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:51:36,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:51:37,471][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:51:38,195][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:51:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:51:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:51:40,374][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:51:41,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:51:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:51:42,548][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:51:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:51:43,998][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:51:44,724][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:51:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:51:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:51:46,901][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:51:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:51:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:51:49,391][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:51:50,116][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:51:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:51:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:51:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:51:53,018][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:51:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:51:54,469][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:51:55,195][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:51:55,920][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:51:56,646][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:51:57,372][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:51:58,099][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:51:58,825][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:51:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:52:00,280][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:52:01,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:52:01,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:52:02,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:52:02,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:52:02,991][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:52:04,413][__main__][INFO] - Iteration 341 took 56s (9.33% Gen, 88.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 17m 12s. Estimated total time: 15h 44m 42s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 28s, 500 more iterations: 7h 52m 21s. [2026-03-25 19:52:04,417][__main__][INFO] - Starting iteration 341. [2026-03-25 19:52:04,424][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:52:04,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:52:15,611][__main__][INFO] - Number of regex retries in iteration 341: 0 [2026-03-25 19:52:15,612][__main__][INFO] - agents played in iteration 341 are Bob, Alice [2026-03-25 19:52:16,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:52:16,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:52:16,219][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:52:16,220][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:52:16,918][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:52:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:52:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:52:19,011][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:52:19,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:52:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:52:21,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:52:21,897][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:52:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:52:23,342][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:52:24,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:52:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:52:25,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:52:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:52:26,951][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:52:27,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:52:28,396][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:52:29,117][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:52:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:52:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:52:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:52:32,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:52:32,732][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:52:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:52:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:52:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:52:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:52:36,350][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:52:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:52:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:52:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:52:39,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:52:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:52:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:52:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:52:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:52:42,859][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:52:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:52:44,307][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:52:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:52:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:52:46,480][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:52:52,736][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:52:53,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:52:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:52:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:52:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:52:56,342][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:52:57,304][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:52:58,026][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:52:58,748][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:52:59,471][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:53:00,193][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:53:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:53:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:53:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:53:03,085][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:53:03,810][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:53:04,532][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:53:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:53:05,980][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:53:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:53:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:53:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:53:08,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:53:09,626][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:52 [2026-03-25 19:53:10,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:53:10,769][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:53:10,771][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:53:12,260][__main__][INFO] - Iteration 342 took 1m 7s (16.49% Gen, 81.31% Train). Generation: 11s, Training: 55s. Estimated remaining time: 13h 22m 0s. Estimated total time: 18h 50m 38s. Time estimates for 10 more iterations: 11m 18s, 100 more iterations: 1h 53m 3s, 500 more iterations: 9h 25m 19s. [2026-03-25 19:53:12,263][__main__][INFO] - Starting iteration 342. [2026-03-25 19:53:12,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:53:12,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:53:17,533][__main__][INFO] - Number of regex retries in iteration 342: 0 [2026-03-25 19:53:17,537][__main__][INFO] - agents played in iteration 342 are Bob, Alice [2026-03-25 19:53:18,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:53:18,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:53:18,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:53:18,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:53:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:53:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:53:20,200][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:53:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:53:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:53:22,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:53:23,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:53:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:53:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:53:25,251][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:53:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:53:26,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:53:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:53:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:53:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:53:29,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:53:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:53:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:53:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:53:32,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:53:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:53:33,919][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:53:34,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:53:35,364][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:53:36,087][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:53:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:53:37,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:53:38,256][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:53:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:53:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:53:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:53:41,148][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:53:41,872][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:53:42,597][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:53:43,319][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:53:44,042][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:53:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:53:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:53:46,212][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:53:46,936][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:53:47,661][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:53:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:53:49,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:53:49,833][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:53:50,557][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:53:51,280][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:53:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:53:52,728][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:53:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:53:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:53:55,138][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:53:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:53:56,586][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:53:57,309][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:53:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:53:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:53:59,489][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:54:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:54:00,939][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:54:01,663][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:54:02,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:54:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:54:03,841][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:54:04,567][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:54:05,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:54:06,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:54:07,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:54:07,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:54:07,463][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:54:08,890][__main__][INFO] - Iteration 343 took 56s (9.30% Gen, 88.17% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 14m 10s. Estimated total time: 15h 43m 45s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 22s, 500 more iterations: 7h 51m 52s. [2026-03-25 19:54:08,893][__main__][INFO] - Starting iteration 343. [2026-03-25 19:54:08,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:54:08,898][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:54:14,837][__main__][INFO] - Number of regex retries in iteration 343: 0 [2026-03-25 19:54:14,838][__main__][INFO] - agents played in iteration 343 are Bob, Alice [2026-03-25 19:54:15,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:54:15,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:54:15,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:54:15,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:54:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:54:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:54:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:54:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:54:18,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:54:19,643][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:54:20,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:54:21,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:54:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:54:22,534][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:54:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:54:23,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:54:24,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:54:25,430][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:54:26,154][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:54:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:54:27,600][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:54:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:54:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:54:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:54:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:54:31,218][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:54:31,940][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:54:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:54:33,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:54:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:54:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:54:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:54:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:54:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:54:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:54:38,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:54:39,176][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:54:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:54:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:54:41,348][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:54:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:54:42,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:54:43,520][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:54:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:54:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:54:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:54:46,416][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:54:47,140][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:54:47,864][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:54:48,590][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:54:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:54:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:54:51,075][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:54:51,801][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:54:52,526][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:54:53,252][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:54:53,977][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:54:54,703][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:54:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:54:56,153][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:54:56,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:54:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:54:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:54:59,052][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:54:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:55:00,501][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:55:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:55:01,951][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:55:02,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:55:03,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:55:04,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:55:05,425][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:55:05,429][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:55:07,935][__main__][INFO] - Iteration 344 took 59s (10.06% Gen, 85.69% Train). Generation: 5s, Training: 50s. Estimated remaining time: 10h 53m 26s. Estimated total time: 16h 24m 0s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 24s, 500 more iterations: 8h 12m 0s. [2026-03-25 19:55:07,940][__main__][INFO] - Starting iteration 344. [2026-03-25 19:55:07,947][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:55:07,948][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:55:19,298][__main__][INFO] - Number of regex retries in iteration 344: 0 [2026-03-25 19:55:19,299][__main__][INFO] - agents played in iteration 344 are Bob, Alice [2026-03-25 19:55:19,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:55:19,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:55:19,858][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:55:19,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:55:20,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:55:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:55:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:55:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:55:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:55:24,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:55:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:55:25,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:55:26,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:55:26,951][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:55:27,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:55:28,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:55:29,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:55:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:55:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:55:31,274][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:55:31,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:55:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:55:33,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:55:34,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:55:34,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:55:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:55:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:55:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:55:37,764][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:55:38,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:55:39,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:55:39,929][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:55:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:55:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:55:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:55:42,817][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:55:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:55:44,266][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:55:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:55:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:55:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:55:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:55:47,878][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:55:48,601][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:55:49,325][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:55:50,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:55:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:55:51,495][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:55:52,219][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:55:52,943][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:55:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:55:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:55:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:55:56,074][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:55:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:55:57,519][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:55:58,244][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:55:58,967][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:55:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:56:00,414][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:56:01,137][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:56:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:56:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:56:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:56:04,032][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:56:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:56:05,479][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:56:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:56:06,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:56:07,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:56:08,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:56:08,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:56:08,975][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:56:10,561][__main__][INFO] - Iteration 345 took 1m 2s (18.13% Gen, 79.33% Train). Generation: 11s, Training: 49s. Estimated remaining time: 11h 52m 0s. Estimated total time: 17h 23m 37s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 21s, 500 more iterations: 8h 41m 48s. [2026-03-25 19:56:10,563][__main__][INFO] - Starting iteration 345. [2026-03-25 19:56:10,567][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:56:10,568][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:56:19,235][__main__][INFO] - Number of regex retries in iteration 345: 0 [2026-03-25 19:56:19,236][__main__][INFO] - agents played in iteration 345 are Bob, Alice [2026-03-25 19:56:19,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:56:19,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:56:19,799][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:56:19,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:56:20,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:56:21,142][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:56:21,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:56:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:56:23,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:56:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:56:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:56:25,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:56:26,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:56:26,902][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:56:27,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:56:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:56:29,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:56:29,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:56:30,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:56:31,229][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:56:31,949][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:56:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:56:33,393][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:56:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:56:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:56:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:56:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:56:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:56:37,726][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:56:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:56:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:56:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:56:40,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:56:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:56:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:56:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:56:43,513][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:56:44,238][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:56:44,961][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:56:45,686][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:56:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:56:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:56:47,852][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:56:48,577][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:56:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:56:50,022][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:56:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:56:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:56:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:56:52,919][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:56:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:56:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:56:55,321][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:56:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:56:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:56:57,495][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:56:58,220][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:56:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:56:59,669][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:57:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:57:01,117][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:57:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:57:02,566][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:57:03,291][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:57:04,018][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:57:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:57:05,468][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:57:06,193][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:57:06,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:57:07,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:57:08,908][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:57:08,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:57:08,913][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:57:10,562][__main__][INFO] - Iteration 346 took 59s (14.45% Gen, 82.80% Train). Generation: 8s, Training: 49s. Estimated remaining time: 11h 7m 20s. Estimated total time: 16h 39m 56s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 59s, 500 more iterations: 8h 19m 58s. [2026-03-25 19:57:10,565][__main__][INFO] - Starting iteration 346. [2026-03-25 19:57:10,570][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:57:10,571][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:57:22,084][__main__][INFO] - Number of regex retries in iteration 346: 0 [2026-03-25 19:57:22,085][__main__][INFO] - agents played in iteration 346 are Bob, Alice [2026-03-25 19:57:22,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:57:22,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:57:22,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:57:22,650][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:57:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:57:24,009][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:57:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:57:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:57:26,169][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:57:26,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:57:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:57:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:57:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:57:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:57:30,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:57:31,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:57:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:57:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:57:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:57:34,087][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:57:34,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:57:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:57:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:57:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:57:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:57:38,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:57:39,137][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:57:39,860][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:57:40,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:57:41,302][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:57:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:57:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:57:43,470][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:57:44,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:57:44,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:57:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:57:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:57:47,078][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:57:47,803][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:57:48,525][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:57:49,247][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:57:49,969][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:57:50,691][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:57:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:57:52,142][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:57:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:57:53,587][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:57:54,310][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:57:55,033][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:57:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:57:56,478][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:57:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:57:58,233][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:57:58,958][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:57:59,680][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:58:00,403][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:58:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:58:01,851][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:58:02,575][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:58:03,300][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:58:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:58:04,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:58:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:58:06,196][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:58:06,921][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:58:07,645][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:58:08,368][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:58:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:58:09,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:58:10,581][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:58:11,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:58:18,505][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:58:18,507][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:58:19,934][__main__][INFO] - Iteration 347 took 1m 9s (16.60% Gen, 81.34% Train). Generation: 11s, Training: 56s. Estimated remaining time: 13h 42m 20s. Estimated total time: 19h 16m 6s. Time estimates for 10 more iterations: 11m 33s, 100 more iterations: 1h 55m 36s, 500 more iterations: 9h 38m 3s. [2026-03-25 19:58:19,937][__main__][INFO] - Starting iteration 347. [2026-03-25 19:58:19,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:58:19,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:58:25,267][__main__][INFO] - Number of regex retries in iteration 347: 0 [2026-03-25 19:58:25,268][__main__][INFO] - agents played in iteration 347 are Bob, Alice [2026-03-25 19:58:25,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:58:25,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:58:25,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:58:25,850][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:58:26,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:58:27,204][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:58:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:58:28,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:58:29,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:58:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:58:30,805][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:58:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:58:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:58:32,971][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:58:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:58:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:58:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:58:35,851][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:58:36,572][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:58:37,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:58:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:58:38,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:58:39,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:58:40,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:58:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:58:41,625][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:58:42,345][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:58:43,067][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:58:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:58:44,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:58:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:58:45,956][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:58:46,678][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:58:47,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:58:48,123][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:58:48,846][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:58:49,566][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:58:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:58:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:58:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:58:52,459][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:58:53,182][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:58:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:58:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:58:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:58:56,074][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:58:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:58:57,519][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:58:58,244][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:58:58,968][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:58:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:59:00,413][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:59:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:59:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 19:59:02,815][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 19:59:03,538][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 19:59:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 19:59:04,989][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 19:59:05,710][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 19:59:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 19:59:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 19:59:07,882][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 19:59:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 19:59:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 19:59:10,054][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 19:59:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 19:59:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 19:59:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 19:59:12,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 19:59:13,696][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 19:59:14,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 19:59:14,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 19:59:14,777][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 19:59:16,583][__main__][INFO] - Iteration 348 took 56s (9.40% Gen, 87.40% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 9m 21s. Estimated total time: 15h 44m 4s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 2s. [2026-03-25 19:59:16,586][__main__][INFO] - Starting iteration 348. [2026-03-25 19:59:16,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 19:59:16,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 19:59:22,407][__main__][INFO] - Number of regex retries in iteration 348: 0 [2026-03-25 19:59:22,408][__main__][INFO] - agents played in iteration 348 are Bob, Alice [2026-03-25 19:59:23,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:59:23,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 19:59:23,075][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 19:59:23,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 19:59:23,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 19:59:24,418][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 19:59:25,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 19:59:25,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 19:59:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 19:59:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 19:59:28,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 19:59:28,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 19:59:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 19:59:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 19:59:30,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 19:59:31,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 19:59:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 19:59:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 19:59:33,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 19:59:34,528][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 19:59:35,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 19:59:35,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 19:59:36,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 19:59:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 19:59:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 19:59:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 19:59:39,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 19:59:40,315][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 19:59:41,039][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 19:59:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 19:59:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 19:59:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 19:59:43,933][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 19:59:44,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 19:59:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 19:59:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 19:59:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 19:59:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 19:59:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 19:59:49,211][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 19:59:49,933][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 19:59:50,656][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 19:59:51,381][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 19:59:52,104][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 19:59:52,828][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 19:59:53,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 19:59:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 19:59:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 19:59:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 19:59:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 19:59:57,176][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 19:59:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 19:59:58,854][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 19:59:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:00:00,303][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:00:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:00:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:00:02,475][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:00:03,200][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:00:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:00:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:00:05,373][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:00:06,098][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:00:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:00:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:00:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:00:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:00:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:00:10,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:00:11,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 20:00:12,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:00:12,237][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:00:12,238][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:00:16,383][__main__][INFO] - Iteration 349 took 59s (9.73% Gen, 83.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 11h 0m 53s. Estimated total time: 16h 36m 35s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 39s, 500 more iterations: 8h 18m 17s. [2026-03-25 20:00:16,386][__main__][INFO] - Starting iteration 349. [2026-03-25 20:00:16,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 20:00:16,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:00:37,056][__main__][INFO] - Number of regex retries in iteration 349: 0 [2026-03-25 20:00:37,057][__main__][INFO] - agents played in iteration 349 are Bob, Alice [2026-03-25 20:00:37,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:00:37,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:00:37,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:00:37,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:00:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:00:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:00:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:00:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:00:41,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:00:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:00:42,598][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:00:43,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:00:44,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:00:44,748][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:00:45,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:00:46,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:00:46,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:00:47,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:00:48,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:00:49,050][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:00:49,768][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:00:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:00:51,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:00:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:00:52,639][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:00:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:00:54,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:00:54,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:00:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:00:56,234][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:00:56,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:00:57,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:00:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:00:59,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:00:59,828][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:01:00,546][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:01:01,266][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:01:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:01:02,707][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:01:03,427][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:01:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:01:04,870][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:01:05,590][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:01:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:01:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:01:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:01:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:01:09,195][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:01:09,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:01:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:01:11,356][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:01:12,078][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:01:13,115][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:01:13,839][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:01:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:01:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:01:16,002][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:01:16,723][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:01:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:01:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:01:18,889][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:01:19,610][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:01:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:01:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:01:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:01:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:01:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:01:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:01:24,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:01:25,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 20:01:26,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:01:26,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:01:26,780][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:01:28,167][__main__][INFO] - Iteration 350 took 1m 11s (28.79% Gen, 69.27% Train). Generation: 20s, Training: 49s. Estimated remaining time: 14h 19m 25s. Estimated total time: 19h 56m 19s. Time estimates for 10 more iterations: 11m 57s, 100 more iterations: 1h 59m 37s, 500 more iterations: 9h 58m 9s. [2026-03-25 20:01:28,170][__main__][INFO] - Starting iteration 350. [2026-03-25 20:01:28,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 20:01:28,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:01:41,555][__main__][INFO] - Number of regex retries in iteration 350: 0 [2026-03-25 20:01:41,556][__main__][INFO] - agents played in iteration 350 are Bob, Alice [2026-03-25 20:01:42,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:01:42,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:01:42,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:01:42,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:01:42,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:01:43,450][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:01:44,170][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:01:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:01:45,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:01:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:01:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:01:47,756][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:01:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:01:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:01:49,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:01:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:01:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:01:52,064][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:01:52,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:01:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:01:54,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:01:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:01:55,658][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:01:56,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:01:57,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:01:57,815][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:01:58,535][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:01:59,253][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:01:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:02:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:02:01,418][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:02:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:02:02,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:02:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:02:04,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:02:05,019][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:02:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:02:06,459][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:02:07,180][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:02:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:02:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:02:09,348][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:02:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:02:10,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:02:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:02:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:02:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:02:14,786][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:02:15,510][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:02:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:02:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:02:17,675][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:02:18,634][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:02:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:02:20,075][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:02:20,798][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:02:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:02:22,240][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:02:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:02:23,684][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:02:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:02:25,129][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:02:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:02:26,571][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:02:27,294][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:02:28,015][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:02:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:02:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:02:30,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:02:30,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 20:02:31,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:02:32,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:02:32,004][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:02:34,846][__main__][INFO] - Iteration 351 took 1m 6s (20.07% Gen, 75.66% Train). Generation: 13s, Training: 50s. Estimated remaining time: 12h 53m 13s. Estimated total time: 18h 31m 13s. Time estimates for 10 more iterations: 11m 6s, 100 more iterations: 1h 51m 7s, 500 more iterations: 9h 15m 36s. [2026-03-25 20:02:34,852][__main__][INFO] - Starting iteration 351. [2026-03-25 20:02:34,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:02:34,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:02:43,137][__main__][INFO] - Number of regex retries in iteration 351: 0 [2026-03-25 20:02:43,139][__main__][INFO] - agents played in iteration 351 are Bob, Alice [2026-03-25 20:02:43,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:02:43,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:02:43,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:02:43,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:02:44,399][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:02:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:02:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:02:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:02:47,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:02:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:02:48,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:02:49,364][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:02:50,080][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:02:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:02:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:02:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:02:52,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:02:53,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:02:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:02:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:02:55,831][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:02:56,549][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:02:57,268][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:02:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:02:58,707][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:02:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:03:00,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:03:00,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:03:01,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:03:02,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:03:03,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:03:03,740][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:03:04,460][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:03:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:03:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:03:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:03:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:03:08,060][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:03:08,779][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:03:09,501][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:03:10,225][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:03:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:03:11,668][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:03:12,389][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:03:13,112][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:03:13,832][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:03:14,553][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:03:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:03:15,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:03:16,719][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:03:17,440][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:03:18,161][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:03:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:03:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:03:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:03:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:03:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:03:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:03:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:03:24,168][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:03:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:03:25,612][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:03:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:03:27,055][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:03:27,775][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:03:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:03:29,220][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:03:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:03:30,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:03:31,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 20:03:32,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:03:32,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:03:32,707][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:03:34,109][__main__][INFO] - Iteration 352 took 59s (13.97% Gen, 83.65% Train). Generation: 8s, Training: 49s. Estimated remaining time: 10h 48m 34s. Estimated total time: 16h 27m 33s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 45s, 500 more iterations: 8h 13m 46s. [2026-03-25 20:03:34,112][__main__][INFO] - Starting iteration 352. [2026-03-25 20:03:34,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:03:34,117][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:03:39,504][__main__][INFO] - Number of regex retries in iteration 352: 0 [2026-03-25 20:03:39,505][__main__][INFO] - agents played in iteration 352 are Bob, Alice [2026-03-25 20:03:40,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:03:40,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:03:40,072][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:03:40,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:03:40,796][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:03:41,447][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:03:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:03:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:03:43,615][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:03:44,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:03:45,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:03:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:03:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:03:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:03:47,941][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:03:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:03:49,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:03:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:03:50,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:03:51,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:03:52,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:03:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:03:53,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:03:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:03:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:03:55,873][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:03:56,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:03:57,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:03:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:03:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:03:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:04:00,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:04:00,936][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:04:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:04:02,383][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:04:03,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:04:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:04:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:04:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:04:06,010][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:04:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:04:07,463][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:04:08,187][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:04:08,911][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:04:09,635][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:04:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:04:11,081][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:04:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:04:12,530][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:04:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:04:13,980][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:04:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:04:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:04:16,506][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:04:17,231][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:04:17,957][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:04:18,682][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:04:19,406][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:04:20,131][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:04:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:04:21,580][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:04:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:04:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:04:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:04:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:04:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:04:25,934][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:04:26,660][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:04:27,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:04:28,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 20:04:29,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:04:29,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:04:29,444][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:04:30,830][__main__][INFO] - Iteration 353 took 56s (9.50% Gen, 88.05% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 5m 19s. Estimated total time: 15h 45m 16s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 31s, 500 more iterations: 7h 52m 38s. [2026-03-25 20:04:30,833][__main__][INFO] - Starting iteration 353. [2026-03-25 20:04:30,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:04:30,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:04:40,530][__main__][INFO] - Number of regex retries in iteration 353: 0 [2026-03-25 20:04:40,531][__main__][INFO] - agents played in iteration 353 are Bob, Alice [2026-03-25 20:04:41,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:04:41,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:04:41,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:04:41,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:04:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:04:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:04:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:04:44,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:04:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:04:45,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:04:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:04:47,142][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:04:47,865][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:04:48,586][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:04:49,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:04:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:04:50,748][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:04:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:04:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:04:52,908][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:04:53,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:04:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:04:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:04:55,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:04:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:04:57,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:04:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:04:58,672][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:04:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:05:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:05:00,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:05:01,560][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:05:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:05:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:05:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:05:04,444][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:05:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:05:05,885][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:05:06,605][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:05:07,325][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:05:08,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:05:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:05:09,489][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:05:10,209][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:05:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:05:11,652][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:05:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:05:13,098][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:05:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:05:14,543][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:05:15,265][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:05:15,989][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:05:16,966][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:05:17,689][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:05:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:05:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:05:19,859][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:05:20,583][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:05:21,305][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:05:22,028][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:05:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:05:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:05:24,198][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:05:24,922][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:05:25,643][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:05:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:05:27,090][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:05:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:05:28,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:05:29,281][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 20:05:30,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:05:30,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:05:30,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:05:31,782][__main__][INFO] - Iteration 354 took 1m 0s (15.90% Gen, 81.84% Train). Generation: 9s, Training: 49s. Estimated remaining time: 11h 14m 45s. Estimated total time: 16h 55m 43s. Time estimates for 10 more iterations: 10m 9s, 100 more iterations: 1h 41m 34s, 500 more iterations: 8h 27m 51s. [2026-03-25 20:05:31,784][__main__][INFO] - Starting iteration 354. [2026-03-25 20:05:31,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:05:31,789][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:05:37,361][__main__][INFO] - Number of regex retries in iteration 354: 0 [2026-03-25 20:05:37,362][__main__][INFO] - agents played in iteration 354 are Bob, Alice [2026-03-25 20:05:38,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:05:38,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:05:38,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:05:38,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:05:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:05:39,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:05:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:05:41,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:05:41,727][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:05:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:05:43,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:05:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:05:44,603][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:05:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:05:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:05:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:05:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:05:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:05:48,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:05:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:05:50,362][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:05:51,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:05:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:05:52,528][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:05:53,249][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:05:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:05:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:05:55,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:05:56,127][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:05:56,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:05:57,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:05:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:05:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:05:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:06:00,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:06:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:06:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:06:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:06:03,320][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:06:04,038][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:06:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:06:05,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:06:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:06:06,913][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:06:07,630][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:06:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:06:09,068][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:06:10,121][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:06:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:06:16,252][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:06:16,969][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:06:17,684][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:06:18,633][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:06:19,351][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:06:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:06:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:06:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:06:22,219][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:06:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:06:23,655][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:06:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:06:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:06:25,810][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:06:26,526][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:06:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:06:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:06:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:06:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:06:30,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:06:30,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 20:06:32,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:06:32,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:06:32,214][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:06:33,987][__main__][INFO] - Iteration 355 took 1m 2s (8.96% Gen, 88.19% Train). Generation: 5s, Training: 54s. Estimated remaining time: 11h 34m 41s. Estimated total time: 17h 16m 40s. Time estimates for 10 more iterations: 10m 22s, 100 more iterations: 1h 43m 40s, 500 more iterations: 8h 38m 20s. [2026-03-25 20:06:34,032][__main__][INFO] - Starting iteration 355. [2026-03-25 20:06:34,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:06:34,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:06:48,231][__main__][INFO] - Number of regex retries in iteration 355: 0 [2026-03-25 20:06:48,233][__main__][INFO] - agents played in iteration 355 are Bob, Alice [2026-03-25 20:06:48,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:06:48,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:06:48,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:06:48,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:06:49,586][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:06:50,228][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:06:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:06:51,654][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:06:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:06:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:06:53,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:06:54,509][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:06:55,223][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:06:55,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:06:56,650][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:06:57,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:06:58,078][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:06:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:06:59,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:07:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:07:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:07:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:07:02,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:07:03,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:07:03,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:07:04,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:07:05,233][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:07:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:07:06,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:07:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:07:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:07:08,809][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:07:09,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:07:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:07:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:07:11,669][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:07:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:07:13,100][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:07:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:07:14,529][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:07:15,245][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:07:15,959][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:07:16,674][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:07:17,390][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:07:18,105][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:07:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:07:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:07:20,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:07:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:07:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:07:22,401][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:07:23,117][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:07:24,151][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:07:24,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:07:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:07:26,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:07:27,014][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:07:27,730][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:07:28,446][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:07:29,162][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:07:29,878][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:07:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:07:31,311][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:07:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:07:32,742][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:07:33,460][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:07:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:07:34,894][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:07:35,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:07:36,340][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:07:37,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:07:37,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:07:37,630][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:07:39,151][__main__][INFO] - Iteration 356 took 1m 5s (21.80% Gen, 75.86% Train). Generation: 14s, Training: 49s. Estimated remaining time: 12h 22m 9s. Estimated total time: 18h 5m 13s. Time estimates for 10 more iterations: 10m 51s, 100 more iterations: 1h 48m 31s, 500 more iterations: 9h 2m 36s. [2026-03-25 20:07:39,154][__main__][INFO] - Starting iteration 356. [2026-03-25 20:07:39,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:07:39,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:07:44,414][__main__][INFO] - Number of regex retries in iteration 356: 0 [2026-03-25 20:07:44,415][__main__][INFO] - agents played in iteration 356 are Bob, Alice [2026-03-25 20:07:44,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:07:44,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:07:44,971][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:07:44,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:07:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:07:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:07:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:07:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:07:48,454][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:07:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:07:49,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:07:50,601][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:07:51,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:07:52,032][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:07:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:07:53,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:07:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:07:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:07:55,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:07:56,330][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:07:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:07:57,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:07:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:07:59,198][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:07:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:08:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:08:01,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:08:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:08:02,778][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:08:03,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:08:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:08:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:08:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:08:06,366][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:08:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:08:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:08:08,518][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:08:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:08:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:08:10,670][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:08:11,386][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:08:12,104][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:08:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:08:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:08:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:08:14,978][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:08:15,695][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:08:16,411][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:08:17,128][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:08:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:08:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:08:19,279][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:08:20,230][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:08:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:08:21,665][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:08:22,383][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:08:23,100][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:08:23,817][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:08:24,536][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:08:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:08:25,969][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:08:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:08:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:08:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:08:28,840][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:08:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:08:30,275][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:08:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:08:31,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:08:32,440][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:08:33,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:08:33,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:08:33,426][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:08:34,824][__main__][INFO] - Iteration 357 took 55s (9.44% Gen, 88.04% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 43m 47s. Estimated total time: 15h 27m 47s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 53s. [2026-03-25 20:08:34,826][__main__][INFO] - Starting iteration 357. [2026-03-25 20:08:34,830][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:08:34,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:08:40,080][__main__][INFO] - Number of regex retries in iteration 357: 0 [2026-03-25 20:08:40,081][__main__][INFO] - agents played in iteration 357 are Bob, Alice [2026-03-25 20:08:40,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:08:40,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:08:40,633][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:08:40,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:08:41,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:08:41,966][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:08:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:08:43,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:08:44,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:08:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:08:45,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:08:46,260][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:08:46,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:08:47,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:08:48,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:08:49,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:08:49,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:08:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:08:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:08:51,989][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:08:52,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:08:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:08:54,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:08:54,854][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:08:55,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:08:56,286][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:08:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:08:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:08:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:08:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:08:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:09:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:09:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:09:02,019][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:09:02,736][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:09:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:09:04,168][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:09:04,886][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:09:05,603][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:09:06,320][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:09:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:09:07,755][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:09:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:09:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:09:09,908][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:09:10,624][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:09:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:09:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:09:12,775][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:09:13,494][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:09:14,210][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:09:14,927][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:09:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:09:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:09:17,313][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:09:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:09:18,748][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:09:19,466][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:09:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:09:20,902][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:09:21,621][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:09:22,339][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:09:23,056][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:09:23,774][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:09:24,492][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:09:25,211][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:09:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:09:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:09:27,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:09:28,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:09:29,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:09:29,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:09:29,255][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:09:30,615][__main__][INFO] - Iteration 358 took 55s (9.41% Gen, 88.15% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 44m 50s. Estimated total time: 15h 29m 47s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 53s. [2026-03-25 20:09:30,618][__main__][INFO] - Starting iteration 358. [2026-03-25 20:09:30,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:09:30,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:09:35,871][__main__][INFO] - Number of regex retries in iteration 358: 0 [2026-03-25 20:09:35,872][__main__][INFO] - agents played in iteration 358 are Bob, Alice [2026-03-25 20:09:36,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:09:36,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:09:36,425][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:09:36,426][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:09:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:09:37,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:09:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:09:39,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:09:39,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:09:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:09:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:09:42,056][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:09:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:09:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:09:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:09:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:09:45,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:09:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:09:47,072][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:09:47,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:09:48,504][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:09:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:09:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:09:50,653][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:09:51,369][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:09:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:09:52,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:09:53,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:09:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:09:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:09:55,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:09:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:09:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:09:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:09:58,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:09:59,258][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:09:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:10:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:10:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:10:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:10:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:10:03,576][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:10:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:10:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:10:05,738][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:10:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:10:07,174][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:10:07,891][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:10:08,611][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:10:09,328][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:10:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:10:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:10:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:10:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:10:13,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:10:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:10:14,608][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:10:15,328][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:10:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:10:16,764][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:10:17,481][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:10:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:10:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:10:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:10:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:10:21,075][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:10:21,798][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:10:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:10:23,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:10:24,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:10:25,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:10:25,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:10:25,249][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:10:26,656][__main__][INFO] - Iteration 359 took 56s (9.37% Gen, 88.12% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 48m 3s. Estimated total time: 15h 33m 56s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 58s. [2026-03-25 20:10:26,659][__main__][INFO] - Starting iteration 359. [2026-03-25 20:10:26,667][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:10:26,667][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:10:32,316][__main__][INFO] - Number of regex retries in iteration 359: 0 [2026-03-25 20:10:32,317][__main__][INFO] - agents played in iteration 359 are Bob, Alice [2026-03-25 20:10:32,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:10:32,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:10:32,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:10:32,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:10:33,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:10:34,207][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:10:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:10:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:10:36,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:10:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:10:37,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:10:38,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:10:39,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:10:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:10:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:10:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:10:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:10:42,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:10:43,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:10:44,234][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:10:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:10:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:10:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:10:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:10:47,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:10:48,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:10:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:10:49,962][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:10:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:10:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:10:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:10:52,829][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:10:53,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:10:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:10:54,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:10:55,697][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:10:56,414][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:10:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:10:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:10:58,565][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:10:59,283][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:11:00,000][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:11:00,717][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:11:01,434][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:11:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:11:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:11:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:11:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:11:05,021][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:11:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:11:06,456][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:11:07,174][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:11:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:11:08,871][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:11:09,588][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:11:10,309][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:11:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:11:11,745][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:11:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:11:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:11:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:11:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:11:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:11:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:11:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:11:17,488][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:11:18,206][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:11:18,925][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:11:19,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:11:20,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:11:21,477][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:11:21,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:11:21,483][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:11:22,911][__main__][INFO] - Iteration 360 took 56s (10.04% Gen, 87.41% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 50m 37s. Estimated total time: 15h 37m 26s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 43s. [2026-03-25 20:11:22,913][__main__][INFO] - Starting iteration 360. [2026-03-25 20:11:22,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:11:22,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:11:28,255][__main__][INFO] - Number of regex retries in iteration 360: 0 [2026-03-25 20:11:28,256][__main__][INFO] - agents played in iteration 360 are Bob, Alice [2026-03-25 20:11:28,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:11:28,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:11:28,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:11:28,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:11:29,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:11:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:11:30,887][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:11:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:11:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:11:33,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:11:33,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:11:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:11:35,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:11:35,899][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:11:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:11:37,331][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:11:38,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:11:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:11:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:11:40,197][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:11:40,913][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:11:41,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:11:42,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:11:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:11:43,779][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:11:44,495][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:11:45,212][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:11:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:11:46,644][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:11:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:11:48,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:11:48,794][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:11:49,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:11:50,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:11:50,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:11:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:11:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:11:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:11:53,815][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:11:54,533][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:11:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:11:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:11:56,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:11:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:11:58,120][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:11:58,836][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:11:59,553][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:12:00,272][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:12:00,988][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:12:01,707][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:12:02,424][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:12:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:12:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:12:04,803][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:12:05,520][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:12:06,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:12:06,955][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:12:07,674][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:12:08,390][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:12:09,109][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:12:09,827][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:12:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:12:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:12:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:12:12,700][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:12:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:12:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:12:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:12:15,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:12:16,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:12:17,652][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:12:17,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:12:17,659][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:12:19,114][__main__][INFO] - Iteration 361 took 56s (9.50% Gen, 87.91% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 48m 54s. Estimated total time: 15h 36m 39s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 19s. [2026-03-25 20:12:19,119][__main__][INFO] - Starting iteration 361. [2026-03-25 20:12:19,126][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:12:19,127][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:12:25,151][__main__][INFO] - Number of regex retries in iteration 361: 0 [2026-03-25 20:12:25,152][__main__][INFO] - agents played in iteration 361 are Bob, Alice [2026-03-25 20:12:25,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:12:25,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:12:25,743][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:12:25,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:12:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:12:27,075][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:12:27,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:12:28,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:12:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:12:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:12:30,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:12:31,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:12:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:12:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:12:33,520][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:12:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:12:34,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:12:35,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:12:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:12:37,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:12:37,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:12:38,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:12:39,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:12:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:12:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:12:41,399][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:12:42,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:12:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:12:43,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:12:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:12:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:12:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:12:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:12:47,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:12:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:12:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:12:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:12:49,994][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:12:50,712][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:12:51,428][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:12:52,145][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:12:52,863][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:12:53,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:12:54,300][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:12:55,018][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:12:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:12:56,455][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:12:57,175][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:12:57,894][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:12:58,612][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:12:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:13:00,050][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:13:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:13:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:13:02,465][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:13:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:13:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:13:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:13:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:13:06,060][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:13:06,779][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:13:07,498][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:13:08,217][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:13:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:13:09,655][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:13:10,375][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:13:11,094][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:13:11,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:13:12,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:13:13,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:13:14,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:13:14,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:13:14,380][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:13:15,721][__main__][INFO] - Iteration 362 took 56s (10.64% Gen, 86.98% Train). Generation: 6s, Training: 49s. Estimated remaining time: 9h 54m 37s. Estimated total time: 15h 43m 18s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 19s, 500 more iterations: 7h 51m 39s. [2026-03-25 20:13:15,723][__main__][INFO] - Starting iteration 362. [2026-03-25 20:13:15,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:13:15,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:13:20,939][__main__][INFO] - Number of regex retries in iteration 362: 0 [2026-03-25 20:13:20,940][__main__][INFO] - agents played in iteration 362 are Bob, Alice [2026-03-25 20:13:21,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:13:21,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:13:21,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:13:21,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:13:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:13:22,936][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:13:23,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:13:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:13:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:13:25,807][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:13:26,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:13:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:13:27,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:13:28,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:13:29,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:13:30,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:13:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:13:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:13:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:13:32,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:13:33,694][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:13:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:13:35,129][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:13:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:13:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:13:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:13:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:13:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:13:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:13:40,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:13:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:13:41,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:13:42,313][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:13:43,032][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:13:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:13:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:13:45,188][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:13:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:13:46,628][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:13:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:13:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:13:48,785][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:13:49,505][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:13:50,224][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:13:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:13:51,663][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:13:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:13:53,100][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:13:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:13:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:13:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:13:55,972][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:13:56,957][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:13:57,677][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:13:58,397][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:13:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:13:59,834][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:14:00,551][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:14:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:14:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:14:02,705][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:14:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:14:04,143][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:14:04,861][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:14:05,582][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:14:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:14:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:14:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:14:08,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:14:09,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:14:10,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:14:10,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:14:10,324][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:14:11,911][__main__][INFO] - Iteration 363 took 56s (9.28% Gen, 87.89% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 46m 48s. Estimated total time: 15h 36m 26s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 13s. [2026-03-25 20:14:11,914][__main__][INFO] - Starting iteration 363. [2026-03-25 20:14:11,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:14:11,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:14:17,072][__main__][INFO] - Number of regex retries in iteration 363: 0 [2026-03-25 20:14:17,073][__main__][INFO] - agents played in iteration 363 are Bob, Alice [2026-03-25 20:14:17,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:14:17,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:14:17,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:14:17,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:14:18,401][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:14:19,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:14:19,769][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:14:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:14:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:14:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:14:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:14:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:14:24,071][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:14:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:14:25,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:14:26,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:14:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:14:27,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:14:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:14:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:14:29,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:14:30,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:14:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:14:31,959][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:14:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:14:33,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:14:34,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:14:34,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:14:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:14:36,270][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:14:36,988][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:14:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:14:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:14:39,145][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:14:39,865][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:14:40,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:14:41,302][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:14:42,021][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:14:42,737][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:14:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:14:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:14:44,890][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:14:45,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:14:46,323][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:14:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:14:47,758][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:14:48,475][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:14:49,194][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:14:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:14:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:14:51,346][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:14:52,065][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:14:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:14:53,735][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:14:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:14:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:14:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:14:56,602][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:14:57,323][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:14:58,039][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:14:58,756][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:14:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:15:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:15:00,911][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:15:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:15:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:15:03,065][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:15:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:15:04,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:15:05,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:15:07,340][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:15:07,345][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:15:07,346][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:15:08,733][__main__][INFO] - Iteration 364 took 56s (9.06% Gen, 88.48% Train). Generation: 5s, Training: 50s. Estimated remaining time: 9h 56m 22s. Estimated total time: 15h 46m 56s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 41s, 500 more iterations: 7h 53m 28s. [2026-03-25 20:15:08,736][__main__][INFO] - Starting iteration 364. [2026-03-25 20:15:08,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:15:08,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:15:14,494][__main__][INFO] - Number of regex retries in iteration 364: 0 [2026-03-25 20:15:14,495][__main__][INFO] - agents played in iteration 364 are Bob, Alice [2026-03-25 20:15:14,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:15:15,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:15:15,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:15:15,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:15:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:15:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:15:17,103][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:15:17,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:15:18,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:15:19,250][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:15:19,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:15:20,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:15:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:15:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:15:22,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:15:23,543][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:15:24,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:15:24,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:15:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:15:26,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:15:27,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:15:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:15:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:15:29,274][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:15:29,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:15:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:15:31,423][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:15:32,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:15:32,855][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:15:33,570][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:15:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:15:35,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:15:35,721][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:15:36,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:15:37,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:15:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:15:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:15:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:15:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:15:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:15:41,457][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:15:42,174][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:15:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:15:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:15:44,324][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:15:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:15:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:15:46,477][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:15:47,196][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:15:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:15:48,630][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:15:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:15:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:15:51,030][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:15:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:15:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:15:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:15:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:15:54,619][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:15:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:15:56,055][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:15:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:15:57,491][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:15:58,208][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:15:58,925][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:15:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:16:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:16:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:16:01,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:16:02,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:16:03,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:16:03,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:16:03,728][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:16:05,045][__main__][INFO] - Iteration 365 took 56s (10.22% Gen, 87.44% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 46m 55s. Estimated total time: 15h 38m 26s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 13s. [2026-03-25 20:16:05,047][__main__][INFO] - Starting iteration 365. [2026-03-25 20:16:05,051][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:16:05,052][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:16:10,130][__main__][INFO] - Number of regex retries in iteration 365: 0 [2026-03-25 20:16:10,132][__main__][INFO] - agents played in iteration 365 are Bob, Alice [2026-03-25 20:16:10,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:16:10,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:16:10,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:16:10,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:16:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:16:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:16:12,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:16:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:16:14,172][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:16:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:16:15,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:16:16,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:16:17,033][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:16:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:16:18,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:16:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:16:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:16:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:16:21,331][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:16:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:16:22,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:16:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:16:24,197][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:16:24,913][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:16:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:16:26,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:16:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:16:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:16:28,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:16:29,210][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:16:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:16:30,643][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:16:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:16:32,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:16:32,792][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:16:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:16:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:16:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:16:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:16:36,375][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:16:37,094][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:16:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:16:38,529][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:16:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:16:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:16:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:16:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:16:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:16:42,836][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:16:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:16:44,271][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:16:44,988][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:16:45,983][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:16:46,701][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:16:47,418][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:16:48,136][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:16:48,854][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:16:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:16:50,293][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:16:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:16:51,731][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:16:52,448][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:16:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:16:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:16:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:16:55,323][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:16:56,039][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:16:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:16:57,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:16:58,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:16:59,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:16:59,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:16:59,648][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:17:01,027][__main__][INFO] - Iteration 366 took 55s (9.07% Gen, 88.46% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 40m 31s. Estimated total time: 15h 32m 58s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 29s. [2026-03-25 20:17:01,031][__main__][INFO] - Starting iteration 366. [2026-03-25 20:17:01,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:17:01,035][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:17:07,985][__main__][INFO] - Number of regex retries in iteration 366: 0 [2026-03-25 20:17:07,986][__main__][INFO] - agents played in iteration 366 are Bob, Alice [2026-03-25 20:17:08,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:17:08,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:17:08,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:17:08,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:17:09,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:17:09,881][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:17:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:17:11,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:17:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:17:12,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:17:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:17:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:17:14,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:17:15,606][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:17:16,320][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:17:17,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:17:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:17:18,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:17:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:17:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:17:20,613][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:17:21,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:17:22,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:17:22,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:17:23,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:17:24,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:17:24,909][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:17:25,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:17:26,339][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:17:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:17:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:17:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:17:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:17:29,923][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:17:30,638][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:17:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:17:32,072][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:17:32,790][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:17:33,505][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:17:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:17:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:17:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:17:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:17:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:17:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:17:38,524][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:17:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:17:39,959][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:17:40,675][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:17:41,392][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:17:42,108][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:17:42,825][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:17:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:17:44,488][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:17:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:17:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:17:46,638][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:17:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:17:48,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:17:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:17:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:17:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:17:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:17:51,660][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:17:52,377][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:17:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:17:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:17:54,530][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:17:55,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:17:55,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:17:57,089][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:17:57,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:17:57,093][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:17:58,580][__main__][INFO] - Iteration 367 took 57s (12.08% Gen, 85.33% Train). Generation: 6s, Training: 49s. Estimated remaining time: 10h 5m 43s. Estimated total time: 15h 59m 8s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 54s, 500 more iterations: 7h 59m 34s. [2026-03-25 20:17:58,583][__main__][INFO] - Starting iteration 367. [2026-03-25 20:17:58,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:17:58,588][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:18:08,153][__main__][INFO] - Number of regex retries in iteration 367: 0 [2026-03-25 20:18:08,154][__main__][INFO] - agents played in iteration 367 are Bob, Alice [2026-03-25 20:18:08,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:18:08,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:18:08,715][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:18:08,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:18:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:18:10,057][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:18:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:18:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:18:12,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:18:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:18:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:18:14,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:18:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:18:15,770][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:18:16,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:18:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:18:17,914][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:18:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:18:19,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:18:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:18:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:18:21,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:18:22,210][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:18:22,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:18:23,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:18:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:18:25,075][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:18:25,789][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:18:26,506][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:18:27,221][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:18:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:18:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:18:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:18:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:18:30,798][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:18:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:18:32,230][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:18:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:18:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:18:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:18:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:18:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:18:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:18:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:18:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:18:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:18:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:18:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:18:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:18:41,535][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:18:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:18:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:18:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:18:44,649][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:18:45,365][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:18:46,079][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:18:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:18:47,512][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:18:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:18:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:18:49,661][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:18:50,377][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:18:51,093][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:18:51,809][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:18:52,525][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:18:53,241][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:18:53,959][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:18:54,678][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:18:55,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:18:56,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:18:57,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:18:57,460][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:18:57,462][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:18:58,821][__main__][INFO] - Iteration 368 took 1m 0s (15.88% Gen, 81.86% Train). Generation: 9s, Training: 49s. Estimated remaining time: 10h 49m 32s. Estimated total time: 16h 43m 56s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 23s, 500 more iterations: 8h 21m 58s. [2026-03-25 20:18:58,825][__main__][INFO] - Starting iteration 368. [2026-03-25 20:18:58,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:18:58,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:19:04,165][__main__][INFO] - Number of regex retries in iteration 368: 0 [2026-03-25 20:19:04,166][__main__][INFO] - agents played in iteration 368 are Bob, Alice [2026-03-25 20:19:04,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:19:04,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:19:04,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:19:04,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:19:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:19:06,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:19:06,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:19:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:19:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:19:08,919][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:19:09,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:19:10,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:19:11,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:19:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:19:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:19:13,213][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:19:13,929][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:19:14,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:19:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:19:16,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:19:16,795][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:19:17,509][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:19:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:19:18,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:19:19,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:19:20,370][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:19:21,086][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:19:21,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:19:22,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:19:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:19:23,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:19:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:19:25,379][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:19:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:19:26,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:19:27,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:19:28,245][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:19:28,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:19:29,677][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:19:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:19:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:19:31,826][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:19:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:19:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:19:33,978][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:19:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:19:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:19:36,128][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:19:36,844][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:19:37,562][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:19:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:19:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:19:40,999][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:19:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:19:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:19:42,150][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:19:42,867][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:19:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:19:44,300][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:19:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:19:45,735][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:19:46,451][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:19:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:19:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:19:48,602][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:19:49,320][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:19:50,038][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:19:50,755][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:19:51,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:19:52,203][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:19:53,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:19:53,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:19:53,664][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:19:55,039][__main__][INFO] - Iteration 369 took 56s (9.49% Gen, 88.06% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 41m 30s. Estimated total time: 15h 36m 51s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 25s. [2026-03-25 20:19:55,043][__main__][INFO] - Starting iteration 369. [2026-03-25 20:19:55,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:19:55,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:20:00,340][__main__][INFO] - Number of regex retries in iteration 369: 0 [2026-03-25 20:20:00,341][__main__][INFO] - agents played in iteration 369 are Bob, Alice [2026-03-25 20:20:00,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:20:01,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:20:01,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:20:01,026][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:20:01,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:20:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:20:03,077][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:20:03,790][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:20:04,506][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:20:05,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:20:05,936][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:20:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:20:07,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:20:08,085][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:20:08,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:20:09,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:20:10,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:20:10,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:20:11,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:20:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:20:13,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:20:13,822][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:20:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:20:15,254][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:20:15,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:20:16,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:20:17,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:20:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:20:18,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:20:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:20:20,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:20:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:20:21,697][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:20:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:20:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:20:23,846][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:20:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:20:25,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:20:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:20:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:20:27,428][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:20:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:20:28,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:20:29,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:20:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:20:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:20:31,727][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:20:32,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:20:33,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:20:33,876][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:20:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:20:35,310][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:20:36,262][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:20:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:20:37,695][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:20:38,414][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:20:39,220][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:20:39,937][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:20:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:20:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:20:42,089][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:20:42,808][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:20:43,526][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:20:44,241][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:20:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:20:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:20:46,392][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:20:47,110][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:20:47,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:20:48,591][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:20:49,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:20:49,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:20:49,990][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:20:51,421][__main__][INFO] - Iteration 370 took 56s (9.34% Gen, 88.12% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 42m 47s. Estimated total time: 15h 39m 4s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 32s. [2026-03-25 20:20:51,425][__main__][INFO] - Starting iteration 370. [2026-03-25 20:20:51,429][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:20:51,429][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:20:59,676][__main__][INFO] - Number of regex retries in iteration 370: 0 [2026-03-25 20:20:59,677][__main__][INFO] - agents played in iteration 370 are Bob, Alice [2026-03-25 20:21:00,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:21:00,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:21:00,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:21:00,470][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:21:01,170][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:21:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:21:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:21:03,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:21:03,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:21:04,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:21:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:21:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:21:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:21:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:21:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:21:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:21:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:21:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:21:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:21:11,822][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:21:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:21:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:21:13,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:21:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:21:15,398][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:21:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:21:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:21:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:21:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:21:18,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:21:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:21:20,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:21:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:21:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:21:22,560][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:21:23,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:21:23,995][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:21:24,710][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:21:25,427][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:21:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:21:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:21:27,573][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:21:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:21:29,005][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:21:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:21:30,437][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:21:31,151][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:21:31,869][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:21:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:21:33,300][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:21:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:21:34,734][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:21:35,703][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:21:36,420][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:21:37,137][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:21:37,851][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:21:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:21:39,284][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:21:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:21:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:21:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:21:42,150][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:21:42,866][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:21:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:21:44,300][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:21:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:21:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:21:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:21:47,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:21:47,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:21:49,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:21:49,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:21:49,233][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:21:50,799][__main__][INFO] - Iteration 371 took 59s (13.89% Gen, 83.47% Train). Generation: 8s, Training: 49s. Estimated remaining time: 10h 32m 16s. Estimated total time: 16h 29m 32s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 57s, 500 more iterations: 8h 14m 46s. [2026-03-25 20:21:50,802][__main__][INFO] - Starting iteration 371. [2026-03-25 20:21:50,806][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:21:50,806][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:21:56,061][__main__][INFO] - Number of regex retries in iteration 371: 0 [2026-03-25 20:21:56,062][__main__][INFO] - agents played in iteration 371 are Bob, Alice [2026-03-25 20:21:56,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:21:56,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:21:56,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:21:56,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:21:57,322][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:21:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:21:58,689][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:21:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:22:00,119][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:22:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:22:01,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:22:02,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:22:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:22:03,696][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:22:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:22:05,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:22:05,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:22:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:22:07,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:22:07,991][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:22:08,709][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:22:09,427][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:22:10,144][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:22:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:22:11,576][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:22:12,293][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:22:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:22:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:22:14,443][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:22:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:22:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:22:16,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:22:17,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:22:18,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:22:18,740][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:22:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:22:20,173][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:22:20,888][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:22:21,606][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:22:22,321][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:22:23,038][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:22:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:22:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:22:25,186][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:22:25,904][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:22:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:22:27,338][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:22:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:22:28,770][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:22:29,486][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:22:30,204][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:22:30,920][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:22:31,918][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:22:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:22:33,350][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:22:34,067][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:22:34,784][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:22:35,499][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:22:36,218][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:22:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:22:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:22:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:22:39,122][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:22:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:22:40,556][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:22:41,275][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:22:41,992][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:22:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:22:43,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:22:44,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:22:45,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:22:45,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:22:45,321][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:22:46,635][__main__][INFO] - Iteration 372 took 55s (9.41% Gen, 88.23% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 32m 19s. Estimated total time: 15h 30m 31s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 15s. [2026-03-25 20:22:46,646][__main__][INFO] - Starting iteration 372. [2026-03-25 20:22:46,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:22:46,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:22:52,181][__main__][INFO] - Number of regex retries in iteration 372: 0 [2026-03-25 20:22:52,182][__main__][INFO] - agents played in iteration 372 are Bob, Alice [2026-03-25 20:22:52,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:22:52,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:22:52,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:22:52,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:22:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:22:54,095][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:22:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:22:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:22:56,242][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:22:56,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:22:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:22:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:22:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:22:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:23:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:23:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:23:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:23:02,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:23:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:23:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:23:04,833][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:23:05,548][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:23:06,265][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:23:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:23:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:23:08,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:23:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:23:09,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:23:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:23:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:23:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:23:12,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:23:13,424][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:23:14,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:23:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:23:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:23:16,287][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:23:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:23:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:23:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:23:19,150][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:23:19,866][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:23:20,583][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:23:21,299][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:23:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:23:22,732][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:23:23,449][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:23:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:23:24,880][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:23:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:23:26,312][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:23:27,028][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:23:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:23:28,689][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:23:29,405][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:23:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:23:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:23:31,555][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:23:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:23:32,987][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:23:33,704][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:23:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:23:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:23:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:23:36,571][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:23:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:23:38,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:23:38,723][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:23:39,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:23:40,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:23:41,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:23:41,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:23:41,774][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:23:43,219][__main__][INFO] - Iteration 373 took 56s (9.76% Gen, 87.68% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 43m 33s. Estimated total time: 15h 42m 42s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 16s, 500 more iterations: 7h 51m 21s. [2026-03-25 20:23:43,221][__main__][INFO] - Starting iteration 373. [2026-03-25 20:23:43,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:23:43,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:23:48,493][__main__][INFO] - Number of regex retries in iteration 373: 0 [2026-03-25 20:23:48,494][__main__][INFO] - agents played in iteration 373 are Bob, Alice [2026-03-25 20:23:48,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:23:49,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:23:49,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:23:49,052][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:23:49,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:23:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:23:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:23:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:23:52,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:23:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:23:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:23:54,683][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:23:55,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:23:56,115][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:23:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:23:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:23:58,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:23:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:23:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:24:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:24:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:24:01,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:24:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:24:03,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:24:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:24:04,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:24:05,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:24:06,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:24:06,850][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:24:07,566][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:24:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:24:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:24:09,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:24:10,432][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:24:11,149][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:24:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:24:12,582][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:24:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:24:14,012][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:24:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:24:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:24:16,160][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:24:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:24:17,592][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:24:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:24:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:24:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:24:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:24:21,177][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:24:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:24:22,611][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:24:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:24:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:24:25,009][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:24:25,726][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:24:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:24:27,158][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:24:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:24:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:24:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:24:30,026][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:24:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:24:31,460][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:24:32,177][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:24:32,893][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:24:33,611][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:24:34,330][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:24:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:24:35,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:24:36,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:24:37,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:24:37,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:24:37,841][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:24:39,279][__main__][INFO] - Iteration 374 took 56s (9.40% Gen, 88.03% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 34m 10s. Estimated total time: 15h 34m 15s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 7s. [2026-03-25 20:24:39,282][__main__][INFO] - Starting iteration 374. [2026-03-25 20:24:39,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:24:39,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:24:44,527][__main__][INFO] - Number of regex retries in iteration 374: 0 [2026-03-25 20:24:44,528][__main__][INFO] - agents played in iteration 374 are Bob, Alice [2026-03-25 20:24:45,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:24:45,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:24:45,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:24:45,094][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:24:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:24:46,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:24:47,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:24:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:24:48,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:24:52,776][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:24:53,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:24:54,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:24:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:24:55,748][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:24:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:24:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:24:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:24:58,604][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:24:59,320][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:25:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:25:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:25:01,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:25:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:25:02,897][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:25:03,610][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:25:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:25:05,040][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:25:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:25:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:25:07,188][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:25:07,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:25:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:25:09,337][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:25:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:25:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:25:11,483][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:25:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:25:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:25:13,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:25:14,346][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:25:15,061][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:25:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:25:16,493][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:25:17,208][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:25:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:25:18,640][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:25:19,357][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:25:20,072][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:25:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:25:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:25:22,221][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:25:22,939][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:25:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:25:24,652][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:25:25,370][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:25:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:25:26,802][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:25:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:25:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:25:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:25:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:25:30,381][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:25:31,099][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:25:31,814][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:25:32,529][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:25:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:25:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:25:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:25:35,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:25:36,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 20:25:37,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:25:37,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:25:37,224][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:25:38,705][__main__][INFO] - Iteration 375 took 59s (8.82% Gen, 88.68% Train). Generation: 5s, Training: 52s. Estimated remaining time: 10h 29m 15s. Estimated total time: 16h 30m 20s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 2s, 500 more iterations: 8h 15m 10s. [2026-03-25 20:25:38,707][__main__][INFO] - Starting iteration 375. [2026-03-25 20:25:38,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:25:38,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:25:44,442][__main__][INFO] - Number of regex retries in iteration 375: 0 [2026-03-25 20:25:44,443][__main__][INFO] - agents played in iteration 375 are Bob, Alice [2026-03-25 20:25:44,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:25:45,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:25:45,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:25:45,025][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:25:45,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:25:46,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:25:47,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:25:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:25:48,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:25:49,223][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:25:49,940][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:25:50,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:25:51,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:25:52,083][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:25:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:25:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:25:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:25:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:25:55,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:25:56,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:25:57,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:25:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:25:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:25:59,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:25:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:26:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:26:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:26:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:26:02,825][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:26:03,541][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:26:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:26:04,970][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:26:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:26:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:26:07,118][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:26:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:26:08,574][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:26:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:26:10,008][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:26:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:26:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:26:12,156][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:26:12,873][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:26:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:26:14,305][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:26:15,021][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:26:15,738][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:26:16,453][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:26:17,170][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:26:17,887][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:26:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:26:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:26:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:26:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:26:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:26:22,416][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:26:23,134][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:26:23,850][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:26:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:26:25,284][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:26:26,000][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:26:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:26:27,434][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:26:28,151][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:26:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:26:29,585][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:26:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:26:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:26:31,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:26:32,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:26:33,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:26:33,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:26:33,900][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:26:35,362][__main__][INFO] - Iteration 376 took 56s (10.12% Gen, 87.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 42m 11s. Estimated total time: 15h 44m 12s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 25s, 500 more iterations: 7h 52m 6s. [2026-03-25 20:26:35,365][__main__][INFO] - Starting iteration 376. [2026-03-25 20:26:35,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:26:35,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:26:42,008][__main__][INFO] - Number of regex retries in iteration 376: 0 [2026-03-25 20:26:42,009][__main__][INFO] - agents played in iteration 376 are Bob, Alice [2026-03-25 20:26:42,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:26:42,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:26:42,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:26:42,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:26:43,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:26:43,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:26:44,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:26:45,428][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:26:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:26:46,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:26:47,572][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:26:48,289][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:26:49,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:26:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:26:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:26:51,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:26:51,865][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:26:52,580][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:26:53,297][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:26:54,013][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:26:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:26:55,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:26:56,161][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:26:56,876][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:26:57,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:26:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:26:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:26:59,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:27:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:27:01,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:27:01,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:27:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:27:03,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:27:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:27:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:27:05,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:27:06,188][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:27:06,905][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:27:07,619][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:27:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:27:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:27:09,768][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:27:10,485][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:27:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:27:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:27:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:27:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:27:20,306][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:27:21,023][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:27:21,735][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:27:22,448][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:27:23,165][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:27:24,141][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:27:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:27:25,573][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:27:26,289][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:27:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:27:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:27:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:27:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:27:29,866][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:27:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:27:31,297][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:27:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:27:32,727][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:27:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:27:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:27:34,874][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:27:35,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:27:36,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:53 [2026-03-25 20:27:37,364][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:27:37,369][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:27:37,371][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:27:38,737][__main__][INFO] - Iteration 377 took 1m 3s (10.48% Gen, 87.36% Train). Generation: 6s, Training: 55s. Estimated remaining time: 11h 33m 6s. Estimated total time: 17h 36m 10s. Time estimates for 10 more iterations: 10m 33s, 100 more iterations: 1h 45m 37s, 500 more iterations: 8h 48m 5s. [2026-03-25 20:27:43,502][__main__][INFO] - Starting iteration 377. [2026-03-25 20:27:43,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:27:43,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:27:48,803][__main__][INFO] - Number of regex retries in iteration 377: 0 [2026-03-25 20:27:48,804][__main__][INFO] - agents played in iteration 377 are Bob, Alice [2026-03-25 20:27:49,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:27:49,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:27:49,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:27:49,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:27:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:27:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:27:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:27:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:27:52,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:27:53,569][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:27:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:27:54,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:27:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:27:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:27:57,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:27:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:27:58,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:27:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:27:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:28:00,697][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:28:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:28:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:28:02,839][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:28:03,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:28:04,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:28:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:28:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:28:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:28:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:28:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:28:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:28:09,271][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:28:09,986][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:28:10,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:28:11,417][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:28:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:28:12,848][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:28:13,561][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:28:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:28:14,993][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:28:15,707][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:28:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:28:17,138][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:28:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:28:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:28:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:28:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:28:20,718][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:28:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:28:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:28:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:28:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:28:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:28:25,294][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:28:26,009][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:28:26,726][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:28:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:28:28,157][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:28:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:28:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:28:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:28:31,021][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:28:31,739][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:28:32,455][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:28:33,171][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:28:33,887][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:28:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:28:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:28:36,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:28:36,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:28:37,912][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:28:37,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:28:37,921][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:28:39,356][__main__][INFO] - Iteration 378 took 55s (9.45% Gen, 87.96% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 26m 33s. Estimated total time: 15h 30m 38s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 19s. [2026-03-25 20:28:39,358][__main__][INFO] - Starting iteration 378. [2026-03-25 20:28:39,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:28:39,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:28:40,355][mllm.models.large_language_model_local][WARNING] - Response ="" did not match regex: (|), retry 1/1 [2026-03-25 20:28:46,852][__main__][INFO] - Number of regex retries in iteration 378: 1 [2026-03-25 20:28:46,853][__main__][INFO] - agents played in iteration 378 are Bob, Alice [2026-03-25 20:28:47,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:28:47,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:28:47,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:28:47,414][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:28:48,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:28:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:28:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:28:50,175][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:28:50,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:28:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:28:52,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:28:53,034][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:28:53,747][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:28:54,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:28:55,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:28:55,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:28:56,606][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:28:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:28:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:28:58,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:28:59,467][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:29:00,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:29:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:29:01,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:29:02,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:29:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:29:03,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:29:04,473][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:29:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:29:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:29:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:29:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:29:08,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:29:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:29:09,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:29:10,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:29:10,919][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:29:11,637][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:29:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:29:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:29:13,783][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:29:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:29:15,216][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:29:15,931][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:29:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:29:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:29:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:29:18,794][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:29:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:29:20,227][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:29:20,942][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:29:21,659][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:29:22,600][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:29:23,317][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:29:24,034][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:29:24,750][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:29:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:29:26,182][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:29:26,897][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:29:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:29:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:29:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:29:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:29:30,477][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:29:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:29:31,910][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:29:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:29:33,342][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:29:34,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:29:34,792][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:29:35,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:29:35,931][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:29:35,933][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:29:37,570][__main__][INFO] - Iteration 379 took 58s (12.87% Gen, 84.31% Train). Generation: 7s, Training: 49s. Estimated remaining time: 10h 5m 7s. Estimated total time: 16h 10m 10s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 1s, 500 more iterations: 8h 5m 5s. [2026-03-25 20:29:37,573][__main__][INFO] - Starting iteration 379. [2026-03-25 20:29:37,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:29:37,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:29:42,972][__main__][INFO] - Number of regex retries in iteration 379: 0 [2026-03-25 20:29:42,973][__main__][INFO] - agents played in iteration 379 are Bob, Alice [2026-03-25 20:29:43,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:29:43,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:29:43,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:29:43,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:29:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:29:44,865][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:29:45,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:29:46,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:29:47,011][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:29:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:29:48,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:29:49,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:29:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:29:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:29:51,302][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:29:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:29:52,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:29:53,451][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:29:54,165][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:29:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:29:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:29:56,313][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:29:57,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:29:57,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:29:58,459][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:29:59,176][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:29:59,891][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:30:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:30:01,323][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:30:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:30:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:30:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:30:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:30:04,902][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:30:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:30:06,333][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:30:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:30:07,764][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:30:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:30:09,196][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:30:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:30:10,628][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:30:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:30:12,061][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:30:12,779][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:30:13,493][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:30:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:30:14,926][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:30:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:30:16,358][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:30:17,076][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:30:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:30:18,739][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:30:19,458][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:30:20,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:30:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:30:21,606][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:30:22,325][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:30:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:30:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:30:24,473][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:30:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:30:25,907][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:30:26,624][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:30:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:30:28,059][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:30:28,776][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:30:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:30:30,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:30:30,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:30:31,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:30:31,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:30:31,908][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:30:33,330][__main__][INFO] - Iteration 380 took 55s (9.68% Gen, 87.77% Train). Generation: 5s, Training: 48s. Estimated remaining time: 9h 23m 15s. Estimated total time: 15h 29m 14s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 37s. [2026-03-25 20:30:33,333][__main__][INFO] - Starting iteration 380. [2026-03-25 20:30:33,337][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:30:33,338][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:30:38,856][__main__][INFO] - Number of regex retries in iteration 380: 0 [2026-03-25 20:30:38,857][__main__][INFO] - agents played in iteration 380 are Bob, Alice [2026-03-25 20:30:39,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:30:39,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:30:39,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:30:39,425][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:30:40,154][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:30:40,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:30:41,522][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:30:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:30:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:30:43,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:30:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:30:45,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:30:45,818][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:30:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:30:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:30:47,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:30:48,688][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:30:49,408][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:30:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:30:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:30:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:30:52,278][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:30:52,996][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:30:53,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:30:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:30:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:30:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:30:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:30:57,297][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:30:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:30:58,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:30:59,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:31:00,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:31:00,883][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:31:01,600][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:31:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:31:03,034][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:31:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:31:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:31:05,181][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:31:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:31:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:31:07,332][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:31:08,049][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:31:08,766][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:31:09,482][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:31:10,198][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:31:10,914][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:31:11,633][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:31:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:31:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:31:13,783][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:31:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:31:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:31:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:31:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:31:17,670][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:31:18,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:31:19,104][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:31:19,822][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:31:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:31:21,255][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:31:21,972][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:31:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:31:23,406][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:31:24,123][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:31:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:31:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:31:26,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:31:27,005][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:31:28,212][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:31:28,216][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:31:28,218][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:31:29,563][__main__][INFO] - Iteration 381 took 56s (9.82% Gen, 87.79% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 30m 12s. Estimated total time: 15h 37m 7s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 33s. [2026-03-25 20:31:29,565][__main__][INFO] - Starting iteration 381. [2026-03-25 20:31:29,569][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:31:29,570][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:31:35,312][__main__][INFO] - Number of regex retries in iteration 381: 0 [2026-03-25 20:31:35,314][__main__][INFO] - agents played in iteration 381 are Bob, Alice [2026-03-25 20:31:35,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:31:35,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:31:35,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:31:35,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:31:36,571][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:31:37,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:31:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:31:38,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:31:39,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:31:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:31:40,797][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:31:41,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:31:42,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:31:42,944][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:31:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:31:44,378][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:31:45,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:31:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:31:46,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:31:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:31:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:31:48,674][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:31:49,390][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:31:50,105][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:31:50,821][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:31:51,537][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:31:52,253][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:31:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:31:53,685][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:31:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:31:55,118][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:31:55,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:31:56,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:31:57,265][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:31:57,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:31:58,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:31:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:32:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:32:00,850][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:32:01,567][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:32:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:32:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:32:03,716][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:32:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:32:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:32:05,865][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:32:06,581][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:32:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:32:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:32:08,732][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:32:09,448][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:32:10,166][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:32:11,116][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:32:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:32:12,548][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:32:13,264][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:32:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:32:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:32:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:32:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:32:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:32:17,564][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:32:18,280][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:32:18,999][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:32:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:32:20,435][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:32:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:32:21,869][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:32:22,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:32:23,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:32:24,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:32:24,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:32:24,494][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:32:25,927][__main__][INFO] - Iteration 382 took 56s (10.19% Gen, 87.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 31m 28s. Estimated total time: 15h 39m 20s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 56s, 500 more iterations: 7h 49m 40s. [2026-03-25 20:32:25,931][__main__][INFO] - Starting iteration 382. [2026-03-25 20:32:25,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:32:25,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:32:31,387][__main__][INFO] - Number of regex retries in iteration 382: 0 [2026-03-25 20:32:31,389][__main__][INFO] - agents played in iteration 382 are Bob, Alice [2026-03-25 20:32:31,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:32:31,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:32:31,978][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:32:31,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:32:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:32:33,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:32:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:32:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:32:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:32:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:32:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:32:37,610][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:32:38,328][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:32:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:32:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:32:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:32:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:32:41,906][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:32:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:32:43,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:32:44,055][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:32:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:32:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:32:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:32:46,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:32:47,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:32:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:32:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:32:49,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:32:50,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:32:51,221][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:32:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:32:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:32:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:32:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:32:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:32:55,519][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:32:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:32:56,952][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:32:57,668][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:32:58,385][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:32:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:32:59,821][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:33:00,539][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:33:01,255][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:33:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:33:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:33:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:33:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:33:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:33:05,562][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:33:06,280][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:33:07,229][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:33:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:33:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:33:09,382][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:33:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:33:10,815][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:33:11,532][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:33:12,249][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:33:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:33:13,684][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:33:14,401][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:33:15,119][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:33:15,836][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:33:16,554][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:33:17,271][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:33:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:33:18,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:33:19,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:33:20,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:33:20,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:33:20,572][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:33:21,885][__main__][INFO] - Iteration 383 took 55s (9.74% Gen, 87.90% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 23m 43s. Estimated total time: 15h 32m 30s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 15s. [2026-03-25 20:33:21,888][__main__][INFO] - Starting iteration 383. [2026-03-25 20:33:21,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:33:21,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:33:30,340][__main__][INFO] - Number of regex retries in iteration 383: 0 [2026-03-25 20:33:30,341][__main__][INFO] - agents played in iteration 383 are Bob, Alice [2026-03-25 20:33:30,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:33:31,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:33:31,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:33:31,003][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:33:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:33:32,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:33:33,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:33:33,779][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:33:34,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:33:35,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:33:35,928][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:33:36,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:33:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:33:38,078][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:33:38,794][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:33:39,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:33:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:33:40,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:33:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:33:42,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:33:43,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:33:43,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:33:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:33:45,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:33:45,962][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:33:46,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:33:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:33:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:33:48,834][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:33:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:33:50,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:33:50,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:33:51,705][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:33:52,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:33:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:33:53,858][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:33:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:33:55,293][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:33:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:33:56,729][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:33:57,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:33:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:33:58,880][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:33:59,595][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:34:00,311][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:34:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:34:01,745][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:34:02,464][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:34:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:34:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:34:04,615][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:34:05,331][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:34:06,383][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:34:07,099][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:34:07,817][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:34:08,533][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:34:09,252][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:34:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:34:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:34:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:34:12,126][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:34:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:34:13,561][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:34:14,279][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:34:14,996][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:34:15,713][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:34:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:34:17,149][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:34:17,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:34:18,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:34:19,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:34:19,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:34:19,801][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:34:21,104][__main__][INFO] - Iteration 384 took 59s (14.27% Gen, 83.53% Train). Generation: 8s, Training: 49s. Estimated remaining time: 10h 17m 7s. Estimated total time: 16h 26m 54s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 41s, 500 more iterations: 8h 13m 27s. [2026-03-25 20:34:21,107][__main__][INFO] - Starting iteration 384. [2026-03-25 20:34:21,112][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:34:21,113][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:34:26,659][__main__][INFO] - Number of regex retries in iteration 384: 0 [2026-03-25 20:34:26,660][__main__][INFO] - agents played in iteration 384 are Bob, Alice [2026-03-25 20:34:27,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:34:27,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:34:27,269][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:34:27,269][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:34:27,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:34:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:34:29,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:34:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:34:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:34:31,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:34:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:34:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:34:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:34:34,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:34:35,074][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:34:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:34:36,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:34:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:34:37,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:34:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:34:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:34:40,092][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:34:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:34:41,526][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:34:42,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:34:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:34:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:34:44,391][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:34:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:34:45,824][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:34:46,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:34:47,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:34:47,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:34:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:34:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:34:50,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:34:50,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:34:51,560][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:34:52,277][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:34:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:34:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:34:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:34:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:34:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:34:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:34:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:34:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:34:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:34:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:35:00,170][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:35:00,887][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:35:01,605][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:35:02,557][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:35:03,275][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:35:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:35:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:35:07,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:35:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:35:08,595][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:35:09,311][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:35:13,468][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:35:14,182][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:35:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:35:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:35:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:35:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:35:17,758][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:35:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:35:19,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:35:19,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 20:35:21,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:35:21,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:35:21,159][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:35:22,653][__main__][INFO] - Iteration 385 took 1m 1s (9.01% Gen, 88.55% Train). Generation: 5s, Training: 54s. Estimated remaining time: 10h 54m 55s. Estimated total time: 17h 5m 43s. Time estimates for 10 more iterations: 10m 15s, 100 more iterations: 1h 42m 34s, 500 more iterations: 8h 32m 51s. [2026-03-25 20:35:22,655][__main__][INFO] - Starting iteration 385. [2026-03-25 20:35:22,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:35:22,660][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:35:28,255][__main__][INFO] - Number of regex retries in iteration 385: 0 [2026-03-25 20:35:28,256][__main__][INFO] - agents played in iteration 385 are Bob, Alice [2026-03-25 20:35:28,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:35:28,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:35:28,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:35:28,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:35:29,505][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:35:30,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:35:30,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:35:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:35:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:35:33,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:35:33,719][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:35:34,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:35:35,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:35:35,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:35:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:35:37,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:35:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:35:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:35:39,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:35:40,145][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:35:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:35:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:35:42,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:35:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:35:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:35:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:35:45,148][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:35:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:35:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:35:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:35:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:35:48,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:35:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:35:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:35:50,875][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:35:51,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:35:52,306][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:35:53,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:35:53,739][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:35:54,454][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:35:55,170][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:35:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:35:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:35:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:35:58,032][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:35:58,750][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:35:59,465][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:36:00,182][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:36:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:36:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:36:02,330][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:36:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:36:03,991][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:36:04,706][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:36:05,422][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:36:06,140][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:36:06,856][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:36:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:36:08,289][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:36:09,007][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:36:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:36:10,440][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:36:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:36:11,873][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:36:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:36:13,305][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:36:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:36:14,739][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:36:15,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:36:16,207][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:36:17,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:36:17,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:36:17,264][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:36:20,146][__main__][INFO] - Iteration 386 took 57s (9.73% Gen, 85.25% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 46m 22s. Estimated total time: 15h 58m 8s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 48s, 500 more iterations: 7h 59m 4s. [2026-03-25 20:36:20,148][__main__][INFO] - Starting iteration 386. [2026-03-25 20:36:20,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:36:20,154][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:36:29,214][__main__][INFO] - Number of regex retries in iteration 386: 0 [2026-03-25 20:36:29,215][__main__][INFO] - agents played in iteration 386 are Bob, Alice [2026-03-25 20:36:29,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:36:29,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:36:29,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:36:29,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:36:30,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:36:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:36:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:36:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:36:33,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:36:33,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:36:34,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:36:35,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:36:36,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:36:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:36:37,534][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:36:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:36:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:36:39,674][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:36:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:36:41,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:36:41,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:36:42,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:36:43,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:36:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:36:44,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:36:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:36:46,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:36:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:36:47,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:36:48,244][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:36:48,958][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:36:49,674][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:36:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:36:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:36:51,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:36:52,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:36:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:36:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:36:54,681][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:36:55,397][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:36:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:36:56,827][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:36:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:36:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:36:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:36:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:37:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:37:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:37:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:37:02,555][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:37:03,271][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:37:03,986][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:37:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:37:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:37:06,453][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:37:07,169][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:37:07,884][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:37:08,599][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:37:09,316][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:37:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:37:10,747][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:37:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:37:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:37:12,893][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:37:13,609][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:37:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:37:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:37:15,755][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:37:16,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:37:17,188][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:37:18,286][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:37:18,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:37:18,292][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:37:19,569][__main__][INFO] - Iteration 387 took 59s (15.25% Gen, 82.60% Train). Generation: 9s, Training: 49s. Estimated remaining time: 10h 17m 33s. Estimated total time: 16h 30m 18s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 1s, 500 more iterations: 8h 15m 9s. [2026-03-25 20:37:19,572][__main__][INFO] - Starting iteration 387. [2026-03-25 20:37:19,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:37:19,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:37:25,367][__main__][INFO] - Number of regex retries in iteration 387: 0 [2026-03-25 20:37:25,368][__main__][INFO] - agents played in iteration 387 are Bob, Alice [2026-03-25 20:37:25,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:37:25,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:37:25,925][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:37:25,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:37:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:37:27,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:37:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:37:28,682][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:37:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:37:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:37:30,825][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:37:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:37:32,255][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:37:32,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:37:33,685][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:37:34,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:37:35,113][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:37:35,830][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:37:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:37:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:37:37,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:37:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:37:39,408][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:37:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:37:40,840][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:37:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:37:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:37:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:37:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:37:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:37:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:37:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:37:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:37:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:37:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:37:48,709][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:37:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:37:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:37:50,855][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:37:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:37:52,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:37:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:37:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:37:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:37:55,149][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:37:55,864][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:37:56,579][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:37:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:37:58,012][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:37:58,727][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:37:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:38:00,159][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:38:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:38:01,822][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:38:02,537][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:38:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:38:03,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:38:04,684][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:38:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:38:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:38:06,834][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:38:07,549][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:38:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:38:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:38:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:38:10,416][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:38:11,133][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:38:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:38:12,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:38:13,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:38:14,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:38:14,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:38:14,599][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:38:16,021][__main__][INFO] - Iteration 388 took 56s (10.26% Gen, 87.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 27m 4s. Estimated total time: 15h 40m 46s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 4s, 500 more iterations: 7h 50m 23s. [2026-03-25 20:38:16,023][__main__][INFO] - Starting iteration 388. [2026-03-25 20:38:16,027][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:38:16,027][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:38:21,596][__main__][INFO] - Number of regex retries in iteration 388: 0 [2026-03-25 20:38:21,598][__main__][INFO] - agents played in iteration 388 are Bob, Alice [2026-03-25 20:38:22,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:38:22,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:38:22,162][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:38:22,163][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:38:22,857][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:38:23,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:38:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:38:24,934][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:38:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:38:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:38:27,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:38:27,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:38:28,512][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:38:29,230][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:38:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:38:30,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:38:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:38:32,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:38:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:38:33,523][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:38:34,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:38:34,955][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:38:35,674][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:38:36,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:38:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:38:37,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:38:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:38:39,253][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:38:39,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:38:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:38:41,402][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:38:42,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:38:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:38:43,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:38:44,267][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:38:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:38:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:38:46,418][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:38:47,133][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:38:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:38:48,566][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:38:49,282][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:38:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:38:50,717][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:38:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:38:52,151][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:38:52,868][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:38:53,586][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:38:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:38:55,018][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:38:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:38:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:38:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:38:58,116][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:38:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:38:59,549][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:39:00,265][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:39:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:39:01,697][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:39:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:39:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:39:03,847][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:39:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:39:05,279][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:39:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:39:06,712][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:39:07,429][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:39:08,146][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:39:08,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:39:09,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:39:10,636][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:39:10,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:39:10,641][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:39:12,021][__main__][INFO] - Iteration 389 took 55s (9.95% Gen, 87.58% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 18m 38s. Estimated total time: 15h 33m 15s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 37s. [2026-03-25 20:39:12,023][__main__][INFO] - Starting iteration 389. [2026-03-25 20:39:12,027][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:39:12,027][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:39:17,251][__main__][INFO] - Number of regex retries in iteration 389: 0 [2026-03-25 20:39:17,252][__main__][INFO] - agents played in iteration 389 are Bob, Alice [2026-03-25 20:39:17,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:39:17,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:39:17,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:39:17,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:39:18,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:39:19,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:39:19,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:39:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:39:21,366][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:39:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:39:22,798][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:39:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:39:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:39:24,946][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:39:25,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:39:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:39:27,090][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:39:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:39:28,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:39:29,236][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:39:29,953][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:39:30,669][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:39:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:39:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:39:32,815][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:39:33,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:39:34,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:39:34,964][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:39:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:39:36,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:39:37,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:39:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:39:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:39:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:39:39,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:39:40,694][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:39:41,413][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:39:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:39:42,845][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:39:43,561][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:39:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:39:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:39:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:39:46,430][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:39:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:39:47,864][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:39:48,580][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:39:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:39:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:39:50,731][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:39:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:39:52,165][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:39:53,222][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:39:53,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:39:54,656][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:39:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:39:56,090][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:39:56,807][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:39:57,525][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:39:58,242][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:39:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:39:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:40:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:40:01,109][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:40:01,827][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:40:02,545][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:40:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:40:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:40:04,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:40:05,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:40:06,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:40:06,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:40:06,374][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:40:07,679][__main__][INFO] - Iteration 390 took 55s (9.39% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 12m 0s. Estimated total time: 15h 27m 34s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 47s. [2026-03-25 20:40:07,682][__main__][INFO] - Starting iteration 390. [2026-03-25 20:40:07,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:40:07,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:40:13,112][__main__][INFO] - Number of regex retries in iteration 390: 0 [2026-03-25 20:40:13,113][__main__][INFO] - agents played in iteration 390 are Bob, Alice [2026-03-25 20:40:13,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:40:13,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:40:13,721][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:40:13,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:40:14,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:40:15,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:40:15,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:40:16,485][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:40:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:40:17,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:40:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:40:19,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:40:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:40:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:40:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:40:22,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:40:22,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:40:23,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:40:24,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:40:25,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:40:25,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:40:26,515][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:40:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:40:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:40:28,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:40:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:40:30,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:40:30,811][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:40:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:40:32,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:40:32,958][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:40:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:40:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:40:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:40:35,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:40:36,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:40:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:40:37,976][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:40:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:40:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:40:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:40:40,843][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:40:41,561][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:40:42,278][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:40:42,995][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:40:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:40:44,427][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:40:45,144][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:40:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:40:46,578][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:40:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:40:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:40:48,951][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:40:49,670][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:40:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:40:51,104][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:40:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:40:52,538][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:40:53,255][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:40:53,974][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:40:54,690][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:40:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:40:56,126][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:40:56,843][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:40:57,561][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:40:58,279][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:40:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:40:59,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:41:00,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:41:01,161][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:41:02,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:41:02,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:41:02,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:41:03,661][__main__][INFO] - Iteration 391 took 55s (9.69% Gen, 87.90% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 16m 27s. Estimated total time: 15h 32m 56s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 28s. [2026-03-25 20:41:03,665][__main__][INFO] - Starting iteration 391. [2026-03-25 20:41:03,669][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:41:03,669][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:41:09,155][__main__][INFO] - Number of regex retries in iteration 391: 0 [2026-03-25 20:41:09,156][__main__][INFO] - agents played in iteration 391 are Bob, Alice [2026-03-25 20:41:09,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:41:09,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:41:09,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:41:09,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:41:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:41:11,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:41:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:41:12,492][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:41:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:41:13,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:41:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:41:15,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:41:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:41:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:41:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:41:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:41:18,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:41:19,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:41:20,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:41:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:41:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:41:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:41:23,232][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:41:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:41:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:41:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:41:26,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:41:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:41:27,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:41:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:41:28,962][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:41:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:41:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:41:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:41:31,830][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:41:32,546][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:41:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:41:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:41:34,698][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:41:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:41:36,133][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:41:36,847][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:41:37,567][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:41:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:41:39,003][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:41:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:41:40,438][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:41:41,158][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:41:41,875][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:41:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:41:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:41:44,033][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:41:44,988][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:41:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:41:46,424][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:41:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:41:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:41:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:41:49,299][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:41:50,020][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:41:50,738][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:41:51,457][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:41:52,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:41:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:41:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:41:54,335][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:41:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:41:55,772][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:41:56,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:41:57,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:41:58,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:41:58,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:41:58,250][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:41:59,615][__main__][INFO] - Iteration 392 took 55s (9.81% Gen, 87.75% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 15m 2s. Estimated total time: 15h 32m 27s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 13s. [2026-03-25 20:41:59,617][__main__][INFO] - Starting iteration 392. [2026-03-25 20:41:59,621][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:41:59,622][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:42:05,514][__main__][INFO] - Number of regex retries in iteration 392: 0 [2026-03-25 20:42:05,516][__main__][INFO] - agents played in iteration 392 are Bob, Alice [2026-03-25 20:42:06,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:42:06,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:42:06,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:42:06,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:42:06,808][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:42:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:42:08,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:42:08,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:42:09,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:42:10,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:42:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:42:11,765][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:42:12,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:42:13,203][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:42:13,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:42:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:42:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:42:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:42:16,796][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:42:17,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:42:18,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:42:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:42:19,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:42:20,386][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:42:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:42:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:42:22,543][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:42:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:42:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:42:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:42:25,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:42:26,132][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:42:26,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:42:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:42:28,290][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:42:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:42:29,725][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:42:30,444][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:42:31,162][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:42:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:42:32,599][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:42:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:42:34,038][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:42:34,754][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:42:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:42:36,193][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:42:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:42:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:42:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:42:39,068][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:42:39,787][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:42:40,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:42:41,520][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:42:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:42:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:42:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:42:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:42:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:42:45,837][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:42:46,557][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:42:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:42:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:42:48,717][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:42:49,437][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:42:50,159][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:42:50,878][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:42:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:42:52,318][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:42:53,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:42:53,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 20:42:54,765][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:42:54,767][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:42:54,768][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:42:56,158][__main__][INFO] - Iteration 393 took 56s (10.42% Gen, 87.11% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 23m 57s. Estimated total time: 15h 42m 19s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 13s, 500 more iterations: 7h 51m 9s. [2026-03-25 20:42:56,161][__main__][INFO] - Starting iteration 393. [2026-03-25 20:42:56,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:42:56,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:43:01,768][__main__][INFO] - Number of regex retries in iteration 393: 0 [2026-03-25 20:43:01,769][__main__][INFO] - agents played in iteration 393 are Bob, Alice [2026-03-25 20:43:02,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:43:02,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:43:02,342][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:43:02,343][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:43:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:43:03,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:43:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:43:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:43:05,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:43:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:43:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:43:08,022][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:43:08,742][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:43:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:43:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:43:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:43:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:43:20,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:43:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:43:22,231][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:43:22,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:43:23,662][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:43:24,377][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:43:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:43:25,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:43:26,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:43:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:43:27,957][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:43:28,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:43:29,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:43:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:43:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:43:31,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:43:32,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:43:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:43:33,694][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:43:34,413][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:43:35,131][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:43:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:43:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:43:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:43:38,007][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:43:38,724][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:43:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:43:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:43:40,874][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:43:41,594][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:43:47,427][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:43:48,300][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:43:49,016][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:43:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:43:50,447][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:43:51,408][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:43:52,124][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:43:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:43:53,557][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:43:54,275][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:43:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:43:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:43:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:43:57,138][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:43:57,855][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:43:58,573][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:43:59,289][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:44:00,004][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:44:00,723][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:44:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:44:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:44:02,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:44:03,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:01:00 [2026-03-25 20:44:04,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:44:04,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:44:04,560][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:44:06,652][__main__][INFO] - Iteration 394 took 1m 10s (7.95% Gen, 89.08% Train). Generation: 5s, Training: 1m 2s. Estimated remaining time: 13h 15m 16s. Estimated total time: 19h 34m 48s. Time estimates for 10 more iterations: 11m 44s, 100 more iterations: 1h 57m 28s, 500 more iterations: 9h 47m 24s. [2026-03-25 20:44:06,654][__main__][INFO] - Starting iteration 394. [2026-03-25 20:44:06,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:44:06,658][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:44:12,567][__main__][INFO] - Number of regex retries in iteration 394: 0 [2026-03-25 20:44:12,568][__main__][INFO] - agents played in iteration 394 are Bob, Alice [2026-03-25 20:44:13,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:44:13,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:44:13,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:44:13,144][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:44:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:44:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:44:15,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:44:15,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:44:16,644][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:44:17,356][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:44:18,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:44:18,784][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:44:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:44:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:44:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:44:21,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:44:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:44:23,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:44:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:44:24,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:44:25,217][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:44:25,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:44:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:44:27,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:44:28,079][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:44:28,793][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:44:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:44:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:44:30,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:44:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:44:32,372][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:44:33,089][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:44:33,803][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:44:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:44:35,236][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:44:35,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:44:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:44:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:44:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:44:38,821][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:44:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:44:40,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:44:40,978][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:44:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:44:42,414][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:44:43,135][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:44:43,851][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:44:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:44:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:44:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:44:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:44:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:44:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:44:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:44:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:44:50,541][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:44:51,258][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:44:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:44:52,692][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:44:53,409][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:44:54,127][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:44:54,845][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:44:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:44:56,281][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:44:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:44:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:44:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:44:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:44:59,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:45:00,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:45:01,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:45:01,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:45:01,572][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:45:02,924][__main__][INFO] - Iteration 395 took 56s (10.50% Gen, 87.09% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 17m 19s. Estimated total time: 15h 37m 47s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 53s. [2026-03-25 20:45:02,926][__main__][INFO] - Starting iteration 395. [2026-03-25 20:45:02,929][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:45:02,930][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:45:08,456][__main__][INFO] - Number of regex retries in iteration 395: 0 [2026-03-25 20:45:08,457][__main__][INFO] - agents played in iteration 395 are Bob, Alice [2026-03-25 20:45:08,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:45:09,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:45:09,048][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:45:09,049][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:45:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:45:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:45:11,128][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:45:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:45:12,559][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:45:13,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:45:13,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:45:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:45:15,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:45:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:45:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:45:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:45:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:45:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:45:19,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:45:20,442][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:45:21,157][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:45:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:45:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:45:23,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:45:24,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:45:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:45:25,455][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:45:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:45:26,889][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:45:27,605][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:45:28,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:45:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:45:29,755][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:45:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:45:31,189][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:45:31,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:45:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:45:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:45:34,058][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:45:34,774][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:45:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:45:36,209][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:45:36,926][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:45:37,645][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:45:38,362][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:45:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:45:39,799][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:45:40,516][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:45:41,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:45:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:45:42,667][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:45:43,384][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:45:44,390][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:45:45,109][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:45:45,826][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:45:46,543][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:45:47,261][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:45:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:45:48,697][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:45:49,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:45:50,132][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:45:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:45:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:45:52,286][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:45:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:45:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:45:54,441][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:45:55,158][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:45:55,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:45:56,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:45:57,714][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:45:57,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:45:57,720][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:45:59,082][__main__][INFO] - Iteration 396 took 56s (9.84% Gen, 87.73% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 14m 29s. Estimated total time: 15h 35m 53s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 56s. [2026-03-25 20:45:59,086][__main__][INFO] - Starting iteration 396. [2026-03-25 20:45:59,093][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:45:59,094][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:46:04,581][__main__][INFO] - Number of regex retries in iteration 396: 0 [2026-03-25 20:46:04,583][__main__][INFO] - agents played in iteration 396 are Bob, Alice [2026-03-25 20:46:05,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:46:05,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:46:05,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:46:05,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:46:05,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:46:06,610][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:46:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:46:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:46:08,765][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:46:09,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:46:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:46:10,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:46:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:46:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:46:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:46:13,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:46:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:46:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:46:15,928][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:46:16,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:46:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:46:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:46:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:46:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:46:20,228][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:46:20,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:46:21,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:46:22,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:46:23,095][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:46:23,812][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:46:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:46:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:46:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:46:26,679][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:46:27,396][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:46:28,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:46:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:46:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:46:30,266][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:46:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:46:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:46:32,421][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:46:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:46:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:46:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:46:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:46:36,006][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:46:36,725][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:46:37,444][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:46:38,161][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:46:38,881][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:46:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:46:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:46:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:46:41,989][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:46:42,707][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:46:43,426][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:46:44,144][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:46:44,861][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:46:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:46:46,297][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:46:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:46:47,734][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:46:48,451][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:46:49,170][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:46:49,887][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:46:50,605][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:46:51,324][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:46:52,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:46:52,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:46:53,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:46:53,718][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:46:53,719][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:46:55,079][__main__][INFO] - Iteration 397 took 55s (9.80% Gen, 87.76% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 10m 49s. Estimated total time: 15h 33m 9s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 34s. [2026-03-25 20:46:55,083][__main__][INFO] - Starting iteration 397. [2026-03-25 20:46:55,087][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:46:55,088][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:47:00,599][__main__][INFO] - Number of regex retries in iteration 397: 0 [2026-03-25 20:47:00,600][__main__][INFO] - agents played in iteration 397 are Bob, Alice [2026-03-25 20:47:01,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:47:01,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:47:01,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:47:01,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:47:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:47:02,520][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:47:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:47:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:47:05,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:47:05,796][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:47:06,512][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:47:07,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:47:07,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:47:08,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:47:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:47:10,095][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:47:10,813][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:47:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:47:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:47:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:47:15,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:47:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:47:16,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:47:17,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:47:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:47:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:47:19,495][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:47:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:47:20,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:47:21,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:47:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:47:23,079][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:47:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:47:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:47:25,229][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:47:25,946][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:47:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:47:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:47:28,096][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:47:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:47:29,529][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:47:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:47:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:47:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:47:32,395][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:47:33,110][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:47:33,829][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:47:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:47:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:47:35,980][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:47:36,698][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:47:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:47:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:47:39,080][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:47:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:47:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:47:41,234][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:47:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:47:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:47:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:47:44,103][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:47:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:47:45,539][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:47:46,258][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:47:46,976][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:47:47,693][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:47:48,411][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:47:49,129][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:47:49,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:47:50,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 20:47:51,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:47:51,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:47:51,568][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:47:52,918][__main__][INFO] - Iteration 398 took 57s (9.53% Gen, 88.13% Train). Generation: 5s, Training: 50s. Estimated remaining time: 9h 40m 34s. Estimated total time: 16h 3m 52s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 23s, 500 more iterations: 8h 1m 56s. [2026-03-25 20:47:52,920][__main__][INFO] - Starting iteration 398. [2026-03-25 20:47:52,925][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:47:52,926][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:47:58,947][__main__][INFO] - Number of regex retries in iteration 398: 0 [2026-03-25 20:47:58,948][__main__][INFO] - agents played in iteration 398 are Bob, Alice [2026-03-25 20:47:59,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:47:59,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:47:59,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:47:59,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:48:00,203][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:48:00,850][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:48:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:48:02,284][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:48:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:48:03,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:48:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:48:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:48:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:48:06,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:48:07,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:48:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:48:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:48:09,458][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:48:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:48:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:48:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:48:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:48:13,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:48:13,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:48:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:48:15,189][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:48:15,904][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:48:16,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:48:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:48:18,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:48:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:48:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:48:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:48:20,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:48:21,639][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:48:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:48:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:48:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:48:24,508][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:48:25,225][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:48:25,943][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:48:26,659][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:48:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:48:28,097][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:48:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:48:29,533][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:48:30,251][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:48:30,969][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:48:31,687][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:48:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:48:33,124][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:48:33,839][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:48:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:48:35,564][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:48:36,282][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:48:37,000][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:48:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:48:38,437][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:48:39,157][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:48:39,876][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:48:40,594][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:48:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:48:42,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:48:42,751][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:48:43,473][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:48:44,192][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:48:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:48:45,631][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:48:46,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:48:47,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:48:48,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:48:48,094][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:48:48,095][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:48:49,475][__main__][INFO] - Iteration 399 took 56s (10.65% Gen, 86.90% Train). Generation: 6s, Training: 49s. Estimated remaining time: 9h 18m 17s. Estimated total time: 15h 42m 32s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 15s, 500 more iterations: 7h 51m 16s. [2026-03-25 20:48:49,477][__main__][INFO] - Starting iteration 399. [2026-03-25 20:48:49,481][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:48:49,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:48:54,969][__main__][INFO] - Number of regex retries in iteration 399: 0 [2026-03-25 20:48:54,970][__main__][INFO] - agents played in iteration 399 are Bob, Alice [2026-03-25 20:48:55,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:48:55,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:48:55,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:48:55,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:48:56,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:48:56,876][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:48:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:48:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:48:59,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:48:59,751][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:49:00,468][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:49:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:49:01,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:49:02,623][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:49:03,341][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:49:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:49:04,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:49:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:49:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:49:06,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:49:07,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:49:08,366][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:49:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:49:09,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:49:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:49:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:49:11,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:49:12,674][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:49:13,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:49:14,108][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:49:14,826][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:49:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:49:16,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:49:16,978][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:49:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:49:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:49:19,136][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:49:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:49:20,573][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:49:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:49:22,009][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:49:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:49:23,444][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:49:24,162][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:49:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:49:25,597][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:49:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:49:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:49:27,751][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:49:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:49:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:49:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:49:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:49:31,574][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:49:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:49:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:49:33,728][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:49:34,448][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:49:35,164][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:49:35,882][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:49:36,600][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:49:37,318][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:49:38,035][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:49:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:49:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:49:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:49:40,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:49:41,629][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:49:42,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:49:43,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:49:44,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:49:44,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:49:44,018][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:49:45,402][__main__][INFO] - Iteration 400 took 55s (9.81% Gen, 87.71% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 6m 51s. Estimated total time: 15h 32m 2s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 1s. [2026-03-25 20:49:45,404][__main__][INFO] - Starting iteration 400. [2026-03-25 20:49:45,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 20:49:45,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:49:50,758][__main__][INFO] - Number of regex retries in iteration 400: 0 [2026-03-25 20:49:50,759][__main__][INFO] - agents played in iteration 400 are Bob, Alice [2026-03-25 20:49:51,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:49:51,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:49:51,317][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:49:51,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:49:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:49:52,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:49:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:49:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:49:54,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:49:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:49:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:49:56,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:49:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:49:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:49:59,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:49:59,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:50:00,537][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:50:01,255][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:50:01,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:50:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:50:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:50:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:50:10,281][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:50:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:50:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:50:12,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:50:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:50:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:50:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:50:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:50:16,008][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:50:16,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:50:17,441][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:50:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:50:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:50:19,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:50:20,306][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:50:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:50:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:50:22,457][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:50:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:50:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:50:24,606][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:50:25,322][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:50:26,038][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:50:26,754][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:50:27,471][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:50:28,186][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:50:28,903][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:50:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:50:30,337][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:50:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:50:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:50:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:50:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:50:34,148][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:50:34,865][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:50:35,584][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:50:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:50:37,016][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:50:37,732][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:50:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:50:39,169][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:50:39,888][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:50:40,604][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:50:41,321][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:50:42,037][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:50:42,756][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:50:43,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:50:44,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:52 [2026-03-25 20:50:45,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:50:45,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:50:45,326][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:50:48,064][__main__][INFO] - Iteration 401 took 1m 2s (8.54% Gen, 87.09% Train). Generation: 5s, Training: 54s. Estimated remaining time: 10h 58m 4s. Estimated total time: 17h 24m 17s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 25s, 500 more iterations: 8h 42m 8s. [2026-03-25 20:50:48,067][__main__][INFO] - Starting iteration 401. [2026-03-25 20:50:48,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:50:48,074][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:50:54,069][__main__][INFO] - Number of regex retries in iteration 401: 0 [2026-03-25 20:50:54,070][__main__][INFO] - agents played in iteration 401 are Bob, Alice [2026-03-25 20:50:54,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:50:54,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:50:54,635][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:50:54,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:50:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:50:55,986][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:50:56,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:50:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:50:58,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:50:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:50:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:51:00,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:51:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:51:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:51:02,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:51:03,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:51:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:51:04,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:51:05,293][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:51:06,007][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:51:06,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:51:07,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:51:08,157][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:51:08,876][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:51:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:51:10,308][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:51:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:51:11,744][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:51:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:51:13,181][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:51:13,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:51:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:51:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:51:16,047][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:51:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:51:17,481][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:51:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:51:18,917][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:51:19,633][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:51:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:51:21,068][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:51:21,787][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:51:22,504][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:51:23,222][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:51:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:51:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:51:25,375][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:51:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:51:26,808][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:51:27,525][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:51:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:51:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:51:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:51:30,682][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:51:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:51:32,114][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:51:32,831][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:51:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:51:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:51:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:51:35,701][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:51:36,417][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:51:37,135][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:51:37,852][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:51:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:51:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:51:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:51:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:51:41,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:51:42,186][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:51:43,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:51:43,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:51:43,156][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:51:49,401][__main__][INFO] - Iteration 402 took 1m 1s (9.77% Gen, 80.04% Train). Generation: 5s, Training: 49s. Estimated remaining time: 10h 34m 55s. Estimated total time: 17h 2m 10s. Time estimates for 10 more iterations: 10m 13s, 100 more iterations: 1h 42m 13s, 500 more iterations: 8h 31m 5s. [2026-03-25 20:51:49,403][__main__][INFO] - Starting iteration 402. [2026-03-25 20:51:49,407][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:51:49,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:51:50,306][mllm.models.large_language_model_local][WARNING] - Response >B did not match regex: (|), retry 1/1 [2026-03-25 20:51:54,811][__main__][INFO] - Number of regex retries in iteration 402: 1 [2026-03-25 20:51:54,812][__main__][INFO] - agents played in iteration 402 are Bob, Alice [2026-03-25 20:51:55,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:51:55,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:51:55,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:51:55,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:51:56,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:51:56,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:51:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:51:58,174][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:51:58,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:51:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:52:00,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:52:01,034][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:52:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:52:02,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:52:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:52:03,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:52:04,610][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:52:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:52:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:52:06,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:52:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:52:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:52:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:52:09,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:52:10,336][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:52:11,049][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:52:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:52:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:52:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:52:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:52:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:52:15,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:52:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:52:16,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:52:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:52:18,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:52:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:52:19,650][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:52:20,366][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:52:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:52:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:52:22,518][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:52:23,234][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:52:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:52:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:52:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:52:26,099][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:52:26,817][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:52:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:52:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:52:28,966][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:52:29,683][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:52:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:52:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:52:32,057][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:52:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:52:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:52:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:52:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:52:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:52:36,356][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:52:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:52:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:52:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:52:39,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:52:39,943][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:52:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:52:41,377][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:52:42,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:52:42,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:52:44,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:52:44,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:52:44,012][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:52:45,409][__main__][INFO] - Iteration 403 took 56s (9.65% Gen, 87.85% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 5m 12s. Estimated total time: 15h 33m 23s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 41s. [2026-03-25 20:52:45,411][__main__][INFO] - Starting iteration 403. [2026-03-25 20:52:45,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:52:45,415][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:52:50,918][__main__][INFO] - Number of regex retries in iteration 403: 0 [2026-03-25 20:52:50,919][__main__][INFO] - agents played in iteration 403 are Bob, Alice [2026-03-25 20:52:51,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:52:51,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:52:51,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:52:51,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:52:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:52:52,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:52:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:52:54,358][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:52:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:52:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:52:56,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:52:57,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:52:57,941][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:52:58,660][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:52:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:53:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:53:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:53:01,528][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:53:02,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:53:02,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:53:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:53:04,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:53:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:53:05,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:53:06,545][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:53:07,261][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:53:07,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:53:08,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:53:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:53:10,128][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:53:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:53:11,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:53:12,278][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:53:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:53:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:53:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:53:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:53:15,860][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:53:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:53:17,294][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:53:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:53:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:53:19,443][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:53:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:53:20,876][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:53:21,594][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:53:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:53:23,028][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:53:23,744][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:53:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:53:25,177][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:53:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:53:26,852][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:53:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:53:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:53:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:53:29,721][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:53:30,439][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:53:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:53:31,873][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:53:32,591][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:53:33,307][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:53:34,023][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:53:34,741][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:53:35,458][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:53:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:53:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:53:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:53:38,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:53:39,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:53:40,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:53:40,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:53:40,284][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:53:41,727][__main__][INFO] - Iteration 404 took 56s (9.77% Gen, 87.66% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 9m 26s. Estimated total time: 15h 38m 34s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 17s. [2026-03-25 20:53:41,731][__main__][INFO] - Starting iteration 404. [2026-03-25 20:53:41,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:53:41,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:53:47,541][__main__][INFO] - Number of regex retries in iteration 404: 0 [2026-03-25 20:53:47,542][__main__][INFO] - agents played in iteration 404 are Bob, Alice [2026-03-25 20:53:48,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:53:48,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:53:48,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:53:48,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:53:49,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:53:49,675][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:53:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:53:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:53:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:53:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:53:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:53:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:53:54,686][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:53:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:53:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:53:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:53:57,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:53:58,268][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:53:58,983][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:53:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:54:00,416][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:54:01,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:54:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:54:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:54:03,283][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:54:03,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:54:04,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:54:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:54:06,146][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:54:06,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:54:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:54:08,293][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:54:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:54:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:54:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:54:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:54:11,877][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:54:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:54:13,309][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:54:14,024][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:54:14,740][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:54:15,456][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:54:16,173][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:54:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:54:17,607][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:54:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:54:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:54:19,756][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:54:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:54:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:54:21,907][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:54:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:54:23,633][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:54:24,349][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:54:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:54:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:54:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:54:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:54:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:54:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:54:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:54:30,085][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:54:30,802][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:54:31,519][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:54:32,236][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:54:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:54:33,671][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:54:34,389][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:54:35,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:54:35,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:54:37,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:54:37,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:54:37,022][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:54:38,433][__main__][INFO] - Iteration 405 took 56s (10.24% Gen, 87.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 14m 55s. Estimated total time: 15h 44m 59s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 29s, 500 more iterations: 7h 52m 29s. [2026-03-25 20:54:38,436][__main__][INFO] - Starting iteration 405. [2026-03-25 20:54:38,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:54:38,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:54:43,903][__main__][INFO] - Number of regex retries in iteration 405: 0 [2026-03-25 20:54:43,904][__main__][INFO] - agents played in iteration 405 are Bob, Alice [2026-03-25 20:54:44,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:54:44,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:54:44,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:54:44,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:54:45,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:54:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:54:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:54:47,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:54:47,951][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:54:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:54:49,384][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:54:50,099][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:54:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:54:51,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:54:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:54:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:54:53,680][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:54:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:54:55,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:54:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:54:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:54:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:54:57,980][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:54:58,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:54:59,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:55:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:55:00,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:55:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:55:02,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:55:02,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:55:03,709][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:55:04,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:55:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:55:05,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:55:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:55:07,294][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:55:08,010][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:55:08,730][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:55:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:55:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:55:10,879][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:55:11,597][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:55:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:55:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:55:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:55:14,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:55:15,179][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:55:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:55:16,614][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:55:17,330][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:55:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:55:18,764][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:55:19,733][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:55:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:55:21,169][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:55:21,887][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:55:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:55:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:55:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:55:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:55:25,476][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:55:26,192][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:55:26,911][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:55:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:55:28,347][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:55:29,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:55:29,782][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:55:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:55:31,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:55:31,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:55:32,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:55:32,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:55:32,977][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:55:34,364][__main__][INFO] - Iteration 406 took 55s (9.77% Gen, 87.75% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 1m 4s. Estimated total time: 15h 32m 4s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 2s. [2026-03-25 20:55:34,367][__main__][INFO] - Starting iteration 406. [2026-03-25 20:55:34,371][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:55:34,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:55:39,848][__main__][INFO] - Number of regex retries in iteration 406: 0 [2026-03-25 20:55:39,849][__main__][INFO] - agents played in iteration 406 are Bob, Alice [2026-03-25 20:55:40,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:55:40,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:55:40,409][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:55:40,409][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:55:41,085][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:55:41,731][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:55:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:55:43,164][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:55:43,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:55:44,598][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:55:45,314][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:55:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:55:46,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:55:47,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:55:48,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:55:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:55:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:55:50,325][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:55:51,041][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:55:51,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:55:52,472][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:55:53,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:55:53,905][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:55:54,620][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:55:55,336][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:55:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:55:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:55:57,487][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:55:58,202][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:55:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:55:59,635][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:56:00,353][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:56:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:56:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:56:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:56:03,218][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:56:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:56:04,650][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:56:05,367][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:56:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:56:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:56:07,518][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:56:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:56:08,953][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:56:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:56:10,388][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:56:11,104][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:56:11,823][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:56:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:56:13,258][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:56:13,974][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:56:14,692][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:56:15,641][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:56:16,359][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:56:17,077][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:56:17,795][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:56:18,513][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:56:19,230][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:56:19,948][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:56:20,665][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:56:21,384][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:56:22,101][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:56:22,821][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:56:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:56:24,256][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:56:24,973][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:56:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:56:26,408][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:56:27,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:56:27,852][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:56:29,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:56:29,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:56:29,095][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:56:30,623][__main__][INFO] - Iteration 407 took 56s (9.74% Gen, 87.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 5m 38s. Estimated total time: 15h 37m 34s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 47s. [2026-03-25 20:56:30,626][__main__][INFO] - Starting iteration 407. [2026-03-25 20:56:30,631][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:56:30,631][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:56:36,092][__main__][INFO] - Number of regex retries in iteration 407: 0 [2026-03-25 20:56:36,093][__main__][INFO] - agents played in iteration 407 are Bob, Alice [2026-03-25 20:56:36,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:56:36,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:56:36,652][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:56:36,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:56:37,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:56:37,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:56:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:56:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:56:40,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:56:40,848][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:56:41,563][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:56:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:56:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:56:43,714][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:56:44,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:56:45,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:56:45,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:56:46,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:56:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:56:48,015][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:56:48,733][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:56:49,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:56:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:56:50,881][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:56:51,597][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:56:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:56:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:56:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:56:54,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:56:55,175][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:56:55,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:56:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:56:57,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:56:58,044][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:56:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:56:59,479][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:57:00,197][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:57:00,914][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:57:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:57:02,347][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:57:03,064][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:57:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:57:04,497][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:57:05,215][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:57:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:57:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:57:07,365][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:57:08,083][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:57:08,801][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:57:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:57:10,235][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:57:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:57:11,959][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:57:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:57:13,394][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:57:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:57:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:57:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:57:16,262][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:57:16,978][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:57:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:57:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:57:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:57:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:57:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:57:21,283][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:57:21,999][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:57:22,717][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:57:23,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:57:24,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:57:25,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:57:25,258][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:57:25,260][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:57:26,613][__main__][INFO] - Iteration 408 took 55s (9.76% Gen, 87.82% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 0m 12s. Estimated total time: 15h 33m 4s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 32s. [2026-03-25 20:57:26,616][__main__][INFO] - Starting iteration 408. [2026-03-25 20:57:26,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:57:26,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:57:32,108][__main__][INFO] - Number of regex retries in iteration 408: 0 [2026-03-25 20:57:32,109][__main__][INFO] - agents played in iteration 408 are Bob, Alice [2026-03-25 20:57:32,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:57:32,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:57:32,672][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:57:32,672][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:57:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:57:34,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:57:34,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:57:35,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:57:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:57:36,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:57:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:57:38,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:57:39,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:57:39,742][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:57:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:57:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:57:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:57:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:57:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:57:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:57:44,763][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:57:45,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:57:46,195][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:57:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:57:47,633][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:57:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:57:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:57:49,782][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:57:50,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:57:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:57:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:57:56,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:57:57,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:57:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:57:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:57:59,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:57:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:58:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:58:01,301][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:58:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:58:02,733][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:58:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:58:04,165][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:58:04,881][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:58:05,598][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:58:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:58:07,028][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:58:07,743][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:58:08,460][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:58:09,176][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:58:09,893][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:58:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:58:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:58:12,298][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:58:13,013][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:58:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:58:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:58:15,161][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:58:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:58:16,593][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:58:17,309][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:58:18,026][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:58:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:58:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:58:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:58:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:58:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:58:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:58:23,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:58:23,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 20:58:24,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:58:24,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:58:24,745][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:58:26,067][__main__][INFO] - Iteration 409 took 59s (9.23% Gen, 88.54% Train). Generation: 5s, Training: 52s. Estimated remaining time: 9h 56m 57s. Estimated total time: 16h 30m 49s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 4s, 500 more iterations: 8h 15m 24s. [2026-03-25 20:58:26,069][__main__][INFO] - Starting iteration 409. [2026-03-25 20:58:26,074][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:58:26,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:58:32,015][__main__][INFO] - Number of regex retries in iteration 409: 0 [2026-03-25 20:58:32,061][__main__][INFO] - agents played in iteration 409 are Bob, Alice [2026-03-25 20:58:32,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:58:32,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:58:32,715][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:58:32,715][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:58:33,401][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:58:34,047][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:58:34,764][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:58:35,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:58:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:58:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:58:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:58:38,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:58:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:58:39,773][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:58:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:58:41,205][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:58:41,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:58:42,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:58:43,352][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:58:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:58:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:58:45,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:58:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:58:46,929][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:58:47,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:58:48,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:58:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:58:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:58:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:58:51,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:58:51,943][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:58:52,657][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:58:53,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:58:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:58:54,806][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:58:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:58:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:58:56,953][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:58:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:58:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:58:59,104][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:58:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:59:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:59:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:59:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 20:59:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 20:59:03,401][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 20:59:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 20:59:04,833][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 20:59:05,550][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 20:59:06,265][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 20:59:06,983][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 20:59:07,934][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 20:59:08,651][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 20:59:09,370][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 20:59:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 20:59:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 20:59:11,519][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 20:59:12,235][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 20:59:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 20:59:13,669][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 20:59:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 20:59:15,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 20:59:15,821][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 20:59:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 20:59:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 20:59:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 20:59:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 20:59:19,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 20:59:20,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 20:59:21,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 20:59:21,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 20:59:21,227][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 20:59:22,813][__main__][INFO] - Iteration 410 took 56s (10.55% Gen, 86.65% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 10m 52s. Estimated total time: 15h 45m 40s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 34s, 500 more iterations: 7h 52m 50s. [2026-03-25 20:59:22,817][__main__][INFO] - Starting iteration 410. [2026-03-25 20:59:22,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 20:59:22,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 20:59:30,020][__main__][INFO] - Number of regex retries in iteration 410: 0 [2026-03-25 20:59:30,021][__main__][INFO] - agents played in iteration 410 are Bob, Alice [2026-03-25 20:59:30,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:59:30,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 20:59:30,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 20:59:30,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 20:59:31,306][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 20:59:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 20:59:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 20:59:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 20:59:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 20:59:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 20:59:35,526][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 20:59:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 20:59:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 20:59:37,674][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 20:59:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 20:59:39,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 20:59:39,825][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 20:59:40,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 20:59:41,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 20:59:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 20:59:42,688][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 20:59:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 20:59:44,121][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 20:59:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 20:59:45,554][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 20:59:46,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 20:59:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 20:59:47,702][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 20:59:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 20:59:49,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 20:59:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 20:59:50,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 20:59:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 20:59:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 20:59:52,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 20:59:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 20:59:54,147][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 20:59:54,866][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 20:59:55,580][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 20:59:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 20:59:57,012][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 20:59:57,727][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 20:59:58,444][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 20:59:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 20:59:59,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:00:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:00:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:00:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:00:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:00:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:00:04,177][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:00:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:00:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:00:06,623][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:00:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:00:08,056][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:00:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:00:09,492][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:00:10,208][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:00:10,926][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:00:11,643][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:00:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:00:13,077][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:00:13,794][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:00:14,510][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:00:15,227][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:00:15,946][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:00:16,663][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:00:17,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:00:18,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:00:19,093][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:00:19,095][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:00:19,097][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:00:20,466][__main__][INFO] - Iteration 411 took 57s (12.48% Gen, 85.13% Train). Generation: 7s, Training: 49s. Estimated remaining time: 9h 24m 59s. Estimated total time: 16h 0m 45s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 4s, 500 more iterations: 8h 0m 22s. [2026-03-25 21:00:20,469][__main__][INFO] - Starting iteration 411. [2026-03-25 21:00:20,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:00:20,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:00:25,953][__main__][INFO] - Number of regex retries in iteration 411: 0 [2026-03-25 21:00:25,954][__main__][INFO] - agents played in iteration 411 are Bob, Alice [2026-03-25 21:00:26,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:00:26,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:00:26,520][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:00:26,521][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:00:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:00:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:00:28,574][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:00:29,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:00:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:00:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:00:31,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:00:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:00:32,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:00:33,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:00:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:00:35,022][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:00:35,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:00:36,457][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:00:37,173][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:00:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:00:38,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:00:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:00:40,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:00:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:00:41,473][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:00:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:00:42,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:00:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:00:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:00:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:00:45,782][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:00:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:00:47,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:00:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:00:48,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:00:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:00:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:00:50,805][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:00:51,521][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:00:52,239][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:00:52,956][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:00:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:00:54,391][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:00:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:00:55,833][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:00:56,553][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:00:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:00:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:00:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:00:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:01:00,140][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:01:00,858][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:01:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:01:02,550][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:01:03,266][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:01:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:01:04,703][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:01:05,422][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:01:06,140][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:01:06,856][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:01:07,574][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:01:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:01:09,012][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:01:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:01:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:01:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:01:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:01:12,604][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:01:13,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:01:14,046][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:01:15,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:01:15,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:01:15,283][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:01:16,756][__main__][INFO] - Iteration 412 took 56s (9.74% Gen, 87.64% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 1m 22s. Estimated total time: 15h 38m 5s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 2s. [2026-03-25 21:01:16,760][__main__][INFO] - Starting iteration 412. [2026-03-25 21:01:16,767][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:01:16,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:01:22,229][__main__][INFO] - Number of regex retries in iteration 412: 0 [2026-03-25 21:01:22,230][__main__][INFO] - agents played in iteration 412 are Bob, Alice [2026-03-25 21:01:22,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:01:23,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:01:23,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:01:23,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:01:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:01:24,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:01:25,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:01:25,824][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:01:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:01:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:01:27,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:01:28,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:01:29,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:01:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:01:30,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:01:31,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:01:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:01:32,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:01:33,705][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:01:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:01:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:01:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:01:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:01:37,286][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:01:38,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:01:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:01:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:01:40,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:01:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:01:41,590][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:01:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:01:43,023][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:01:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:01:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:01:45,175][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:01:45,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:01:46,610][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:01:47,326][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:01:48,044][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:01:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:01:49,479][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:01:50,196][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:01:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:01:51,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:01:52,349][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:01:53,067][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:01:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:01:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:01:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:01:55,939][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:01:56,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:01:57,376][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:01:58,325][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:01:59,043][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:01:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:02:00,479][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:02:01,197][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:02:01,916][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:02:02,636][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:02:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:02:04,070][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:02:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:02:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:02:06,226][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:02:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:02:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:02:08,381][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:02:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:02:09,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:02:10,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:02:11,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:02:11,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:02:11,519][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:02:13,148][__main__][INFO] - Iteration 413 took 56s (9.68% Gen, 87.42% Train). Generation: 5s, Training: 49s. Estimated remaining time: 9h 2m 4s. Estimated total time: 15h 39m 43s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 58s, 500 more iterations: 7h 49m 51s. [2026-03-25 21:02:13,152][__main__][INFO] - Starting iteration 413. [2026-03-25 21:02:13,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:02:13,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:02:18,392][__main__][INFO] - Number of regex retries in iteration 413: 0 [2026-03-25 21:02:18,393][__main__][INFO] - agents played in iteration 413 are Bob, Alice [2026-03-25 21:02:18,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:02:18,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:02:18,959][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:02:18,960][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:02:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:02:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:02:21,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:02:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:02:22,448][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:02:23,164][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:02:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:02:24,596][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:02:25,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:02:26,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:02:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:02:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:02:28,183][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:02:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:02:29,616][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:02:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:02:31,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:02:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:02:32,484][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:02:33,202][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:02:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:02:34,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:02:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:02:36,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:02:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:02:37,511][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:02:38,229][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:02:38,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:02:39,665][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:02:40,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:02:41,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:02:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:02:42,535][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:02:43,253][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:02:43,969][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:02:44,688][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:02:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:02:46,123][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:02:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:02:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:02:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:02:48,996][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:02:49,714][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:02:50,431][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:02:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:02:51,868][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:02:52,586][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:02:53,305][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:02:54,310][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:02:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:02:55,748][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:02:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:02:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:02:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:02:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:02:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:03:00,057][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:03:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:03:01,495][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:03:02,214][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:03:02,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:03:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:03:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:03:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:03:05,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:03:06,559][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:03:07,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:03:07,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:03:07,474][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:03:08,803][__main__][INFO] - Iteration 414 took 55s (9.41% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 48m 53s. Estimated total time: 15h 27m 28s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 44s. [2026-03-25 21:03:08,806][__main__][INFO] - Starting iteration 414. [2026-03-25 21:03:08,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:03:08,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:03:14,250][__main__][INFO] - Number of regex retries in iteration 414: 0 [2026-03-25 21:03:14,252][__main__][INFO] - agents played in iteration 414 are Bob, Alice [2026-03-25 21:03:14,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:03:14,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:03:14,812][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:03:14,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:03:15,496][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:03:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:03:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:03:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:03:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:03:19,008][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:03:19,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:03:20,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:03:21,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:03:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:03:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:03:23,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:03:24,028][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:03:24,744][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:03:25,462][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:03:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:03:26,896][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:03:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:03:28,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:03:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:03:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:03:30,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:03:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:03:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:03:32,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:03:33,354][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:03:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:03:34,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:03:35,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:03:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:03:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:03:37,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:03:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:03:39,102][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:03:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:03:40,539][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:03:41,258][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:03:41,977][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:03:42,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:03:43,414][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:03:44,132][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:03:44,850][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:03:45,571][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:03:46,288][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:03:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:03:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:03:48,444][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:03:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:03:50,146][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:03:50,864][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:03:51,583][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:03:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:03:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:03:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:03:54,459][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:03:55,177][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:03:55,897][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:03:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:03:57,335][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:03:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:03:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:03:59,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:04:00,209][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:04:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:04:01,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:04:02,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:04:03,517][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:04:03,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:04:03,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:04:04,861][__main__][INFO] - Iteration 415 took 56s (9.71% Gen, 87.90% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 54m 43s. Estimated total time: 15h 34m 13s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 6s. [2026-03-25 21:04:04,864][__main__][INFO] - Starting iteration 415. [2026-03-25 21:04:04,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:04:04,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:04:10,117][__main__][INFO] - Number of regex retries in iteration 415: 0 [2026-03-25 21:04:10,119][__main__][INFO] - agents played in iteration 415 are Bob, Alice [2026-03-25 21:04:10,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:04:10,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:04:10,701][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:04:10,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:04:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:04:12,034][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:04:12,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:04:13,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:04:14,185][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:04:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:04:15,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:04:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:04:17,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:04:17,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:04:18,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:04:19,208][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:04:19,925][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:04:20,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:04:21,359][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:04:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:04:22,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:04:23,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:04:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:04:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:04:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:04:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:04:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:04:27,815][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:04:28,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:04:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:04:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:04:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:04:31,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:04:32,121][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:04:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:04:33,558][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:04:34,276][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:04:34,995][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:04:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:04:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:04:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:04:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:04:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:04:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:04:40,024][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:04:40,743][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:04:41,460][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:04:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:04:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:04:43,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:04:44,336][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:04:45,053][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:04:46,000][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:04:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:04:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:04:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:04:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:04:49,595][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:04:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:04:51,033][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:04:51,752][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:04:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:04:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:04:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:04:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:04:55,348][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:04:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:04:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:04:57,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:04:58,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:04:59,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:04:59,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:04:59,753][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:05:01,239][__main__][INFO] - Iteration 416 took 56s (9.31% Gen, 88.05% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 59m 6s. Estimated total time: 15h 39m 33s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 57s, 500 more iterations: 7h 49m 46s. [2026-03-25 21:05:01,242][__main__][INFO] - Starting iteration 416. [2026-03-25 21:05:01,246][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:05:01,247][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:05:06,448][__main__][INFO] - Number of regex retries in iteration 416: 0 [2026-03-25 21:05:06,449][__main__][INFO] - agents played in iteration 416 are Bob, Alice [2026-03-25 21:05:07,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:05:07,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:05:07,101][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:05:07,102][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:05:07,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:05:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:05:09,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:05:09,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:05:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:05:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:05:12,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:05:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:05:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:05:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:05:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:05:15,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:05:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:05:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:05:17,757][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:05:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:05:19,193][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:05:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:05:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:05:21,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:05:22,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:05:22,781][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:05:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:05:24,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:05:24,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:05:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:05:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:05:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:05:27,812][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:05:28,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:05:29,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:05:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:05:30,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:05:31,403][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:05:32,122][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:05:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:05:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:05:34,276][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:05:34,995][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:05:35,713][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:05:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:05:37,150][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:05:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:05:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:05:39,309][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:05:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:05:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:05:41,465][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:05:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:05:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:05:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:05:44,626][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:05:45,346][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:05:46,066][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:05:46,785][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:05:47,504][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:05:48,225][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:05:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:05:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:05:50,383][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:05:51,102][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:05:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:05:52,540][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:05:53,260][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:05:53,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:05:54,738][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:05:55,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:05:55,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:05:55,933][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:05:57,550][__main__][INFO] - Iteration 417 took 56s (9.24% Gen, 87.88% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 57m 3s. Estimated total time: 15h 38m 26s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 13s. [2026-03-25 21:05:57,553][__main__][INFO] - Starting iteration 417. [2026-03-25 21:05:57,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:05:57,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:06:03,075][__main__][INFO] - Number of regex retries in iteration 417: 0 [2026-03-25 21:06:03,077][__main__][INFO] - agents played in iteration 417 are Bob, Alice [2026-03-25 21:06:03,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:06:03,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:06:03,662][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:06:03,662][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:06:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:06:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:06:05,717][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:06:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:06:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:06:07,868][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:06:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:06:09,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:06:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:06:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:06:11,456][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:06:12,173][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:06:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:06:13,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:06:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:06:15,046][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:06:15,762][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:06:16,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:06:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:06:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:06:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:06:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:06:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:06:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:06:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:06:22,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:06:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:06:23,657][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:06:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:06:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:06:25,810][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:06:26,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:06:27,246][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:06:27,965][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:06:28,683][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:06:29,401][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:06:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:06:30,839][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:06:31,558][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:06:32,277][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:06:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:06:33,714][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:06:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:06:35,151][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:06:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:06:36,590][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:06:37,309][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:06:38,028][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:06:39,012][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:06:39,733][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:06:40,450][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:06:41,169][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:06:41,888][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:06:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:06:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:06:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:06:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:06:45,485][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:06:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:06:46,924][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:06:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:06:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:06:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:06:49,802][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:06:50,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:06:51,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:06:52,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:06:52,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:06:52,404][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:06:53,736][__main__][INFO] - Iteration 418 took 56s (9.82% Gen, 87.80% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 54m 0s. Estimated total time: 15h 36m 20s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 10s. [2026-03-25 21:06:53,739][__main__][INFO] - Starting iteration 418. [2026-03-25 21:06:53,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:06:53,744][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:06:58,849][__main__][INFO] - Number of regex retries in iteration 418: 0 [2026-03-25 21:06:58,850][__main__][INFO] - agents played in iteration 418 are Bob, Alice [2026-03-25 21:06:59,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:06:59,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:06:59,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:06:59,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:07:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:07:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:07:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:07:02,179][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:07:02,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:07:03,612][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:07:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:07:05,046][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:07:05,764][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:07:08,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:07:10,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:07:11,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:07:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:07:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:07:13,542][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:07:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:07:14,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:07:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:07:16,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:07:17,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:07:17,843][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:07:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:07:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:07:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:07:20,710][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:07:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:07:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:07:22,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:07:23,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:07:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:07:25,015][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:07:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:07:26,448][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:07:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:07:27,884][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:07:28,602][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:07:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:07:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:07:30,754][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:07:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:07:32,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:07:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:07:33,625][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:07:34,342][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:07:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:07:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:07:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:07:37,212][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:07:38,161][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:07:38,878][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:07:39,598][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:07:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:07:41,032][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:07:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:07:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:07:43,186][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:07:43,904][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:07:44,623][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:07:45,340][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:07:46,058][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:07:46,776][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:07:47,494][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:07:48,213][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:07:48,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:07:49,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:07:50,370][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 21:07:51,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:07:51,469][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:07:51,471][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:07:52,872][__main__][INFO] - Iteration 419 took 59s (8.63% Gen, 88.99% Train). Generation: 5s, Training: 52s. Estimated remaining time: 9h 42m 13s. Estimated total time: 16h 25m 31s. Time estimates for 10 more iterations: 9m 51s, 100 more iterations: 1h 38m 33s, 500 more iterations: 8h 12m 45s. [2026-03-25 21:07:52,875][__main__][INFO] - Starting iteration 419. [2026-03-25 21:07:52,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:07:52,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:07:58,474][__main__][INFO] - Number of regex retries in iteration 419: 0 [2026-03-25 21:07:58,475][__main__][INFO] - agents played in iteration 419 are Bob, Alice [2026-03-25 21:07:58,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:07:59,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:07:59,035][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:07:59,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:07:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:08:00,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:08:01,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:08:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:08:02,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:08:03,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:08:03,948][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:08:04,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:08:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:08:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:08:06,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:08:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:08:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:08:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:08:09,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:08:10,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:08:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:08:11,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:08:12,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:08:13,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:08:13,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:08:14,696][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:08:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:08:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:08:16,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:08:17,565][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:08:18,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:08:18,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:08:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:08:20,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:08:21,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:08:21,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:08:22,586][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:08:23,305][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:08:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:08:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:08:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:08:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:08:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:08:27,609][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:08:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:08:29,044][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:08:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:08:30,480][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:08:31,198][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:08:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:08:32,635][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:08:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:08:34,299][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:08:35,019][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:08:35,737][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:08:36,455][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:08:37,173][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:08:37,892][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:08:38,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:08:39,328][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:08:40,045][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:08:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:08:41,482][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:08:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:08:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:08:43,637][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:08:44,357][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:08:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:08:45,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:08:46,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:08:47,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:08:47,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:08:47,570][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:08:48,909][__main__][INFO] - Iteration 420 took 56s (9.99% Gen, 87.62% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 49m 37s. Estimated total time: 15h 33m 52s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 56s. [2026-03-25 21:08:48,911][__main__][INFO] - Starting iteration 420. [2026-03-25 21:08:48,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:08:48,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:08:57,850][__main__][INFO] - Number of regex retries in iteration 420: 0 [2026-03-25 21:08:57,852][__main__][INFO] - agents played in iteration 420 are Bob, Alice [2026-03-25 21:08:58,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:08:58,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:08:58,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:08:58,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:08:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:08:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:09:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:09:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:09:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:09:02,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:09:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:09:04,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:09:04,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:09:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:09:06,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:09:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:09:07,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:09:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:09:09,276][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:09:09,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:09:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:09:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:09:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:09:12,855][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:09:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:09:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:09:15,004][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:09:15,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:09:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:09:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:09:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:09:18,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:09:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:09:20,019][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:09:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:09:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:09:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:09:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:09:23,602][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:09:24,319][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:09:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:09:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:09:26,470][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:09:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:09:27,906][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:09:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:09:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:09:30,057][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:09:30,776][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:09:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:09:32,210][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:09:32,928][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:09:33,987][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:09:34,706][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:09:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:09:36,142][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:09:36,859][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:09:37,577][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:09:38,295][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:09:39,012][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:09:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:09:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:09:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:09:41,884][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:09:42,604][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:09:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:09:44,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:09:44,757][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:09:45,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:09:46,221][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:09:47,395][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:09:47,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:09:47,402][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:09:48,705][__main__][INFO] - Iteration 421 took 59s (14.94% Gen, 82.87% Train). Generation: 8s, Training: 49s. Estimated remaining time: 9h 51m 16s. Estimated total time: 16h 36m 31s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 39s, 500 more iterations: 8h 18m 15s. [2026-03-25 21:09:48,708][__main__][INFO] - Starting iteration 421. [2026-03-25 21:09:48,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:09:48,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:09:53,816][__main__][INFO] - Number of regex retries in iteration 421: 0 [2026-03-25 21:09:53,818][__main__][INFO] - agents played in iteration 421 are Bob, Alice [2026-03-25 21:09:54,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:09:54,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:09:54,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:09:54,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:09:55,071][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:09:55,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:09:56,435][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:09:57,153][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:09:57,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:09:58,586][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:09:59,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:10:00,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:10:00,737][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:10:01,454][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:10:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:10:02,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:10:03,602][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:10:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:10:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:10:05,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:10:06,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:10:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:10:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:10:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:10:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:10:10,051][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:10:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:10:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:10:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:10:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:10:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:10:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:10:15,072][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:10:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:10:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:10:17,227][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:10:17,944][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:10:18,660][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:10:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:10:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:10:20,815][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:10:21,533][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:10:22,251][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:10:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:10:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:10:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:10:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:10:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:10:26,560][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:10:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:10:27,997][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:10:28,714][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:10:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:10:30,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:10:31,105][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:10:31,825][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:10:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:10:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:10:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:10:34,697][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:10:35,417][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:10:36,135][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:10:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:10:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:10:38,290][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:10:39,009][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:10:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:10:40,449][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:10:41,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:10:41,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:10:42,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:10:42,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:10:42,931][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:10:44,345][__main__][INFO] - Iteration 422 took 55s (9.17% Gen, 88.28% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 41m 2s. Estimated total time: 15h 27m 12s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 36s. [2026-03-25 21:10:44,347][__main__][INFO] - Starting iteration 422. [2026-03-25 21:10:44,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:10:44,352][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:10:49,367][__main__][INFO] - Number of regex retries in iteration 422: 0 [2026-03-25 21:10:49,369][__main__][INFO] - agents played in iteration 422 are Bob, Alice [2026-03-25 21:10:49,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:10:49,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:10:49,951][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:10:49,952][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:10:50,639][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:10:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:10:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:10:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:10:53,436][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:10:54,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:10:54,869][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:10:55,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:10:56,303][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:10:57,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:10:57,736][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:10:58,454][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:10:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:10:59,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:11:00,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:11:01,322][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:11:02,040][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:11:02,756][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:11:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:11:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:11:04,910][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:11:05,628][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:11:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:11:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:11:07,779][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:11:08,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:11:09,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:11:09,936][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:11:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:11:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:11:12,091][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:11:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:11:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:11:14,246][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:11:14,964][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:11:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:11:16,401][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:11:17,119][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:11:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:11:18,557][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:11:19,274][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:11:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:11:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:11:21,430][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:11:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:11:22,866][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:11:23,587][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:11:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:11:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:11:25,983][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:11:26,700][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:11:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:11:28,140][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:11:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:11:29,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:11:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:11:31,015][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:11:31,735][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:11:32,454][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:11:33,172][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:11:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:11:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:11:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:11:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:11:36,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:11:37,503][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:11:38,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:11:38,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:11:38,729][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:11:40,122][__main__][INFO] - Iteration 423 took 55s (9.00% Gen, 88.50% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 42m 27s. Estimated total time: 15h 29m 33s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 46s. [2026-03-25 21:11:40,125][__main__][INFO] - Starting iteration 423. [2026-03-25 21:11:40,129][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:11:40,130][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:11:45,190][__main__][INFO] - Number of regex retries in iteration 423: 0 [2026-03-25 21:11:45,194][__main__][INFO] - agents played in iteration 423 are Bob, Alice [2026-03-25 21:11:45,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:11:45,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:11:45,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:11:45,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:11:46,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:11:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:11:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:11:48,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:11:49,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:11:50,075][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:11:50,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:11:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:11:52,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:11:52,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:11:53,664][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:11:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:11:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:11:55,819][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:11:56,537][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:11:57,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:11:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:11:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:11:59,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:12:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:12:00,850][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:12:01,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:12:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:12:03,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:12:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:12:04,441][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:12:05,160][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:12:05,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:12:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:12:07,316][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:12:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:12:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:12:09,473][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:12:10,191][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:12:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:12:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:12:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:12:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:12:13,785][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:12:14,505][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:12:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:12:15,942][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:12:16,662][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:12:17,382][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:12:18,100][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:12:18,821][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:12:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:12:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:12:21,270][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:12:21,990][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:12:22,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:12:23,427][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:12:24,147][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:12:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:12:25,585][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:12:26,305][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:12:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:12:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:12:28,465][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:12:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:12:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:12:30,624][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:12:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:12:32,064][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:12:32,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:12:33,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:12:34,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:12:34,571][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:12:34,573][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:12:36,032][__main__][INFO] - Iteration 424 took 55s (9.06% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 43m 43s. Estimated total time: 15h 31m 45s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 10s, 500 more iterations: 7h 45m 52s. [2026-03-25 21:12:36,039][__main__][INFO] - Starting iteration 424. [2026-03-25 21:12:36,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:12:36,045][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:12:41,140][__main__][INFO] - Number of regex retries in iteration 424: 0 [2026-03-25 21:12:41,141][__main__][INFO] - agents played in iteration 424 are Bob, Alice [2026-03-25 21:12:41,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:12:41,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:12:41,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:12:41,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:12:42,390][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:12:43,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:12:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:12:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:12:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:12:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:12:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:12:47,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:12:48,057][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:12:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:12:49,492][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:12:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:12:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:12:51,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:12:52,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:12:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:12:53,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:12:54,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:12:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:12:55,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:12:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:12:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:12:58,103][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:12:58,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:12:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:13:00,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:13:00,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:13:01,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:13:02,416][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:13:03,134][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:13:03,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:13:04,573][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:13:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:13:06,011][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:13:06,730][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:13:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:13:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:13:08,889][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:13:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:13:10,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:13:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:13:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:13:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:13:13,205][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:13:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:13:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:13:15,363][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:13:16,083][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:13:17,036][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:13:17,758][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:13:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:13:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:13:19,917][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:13:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:13:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:13:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:13:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:13:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:13:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:13:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:13:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:13:26,397][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:13:27,116][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:13:27,838][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:13:28,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:13:29,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:13:30,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:13:30,775][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:13:30,777][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:13:32,361][__main__][INFO] - Iteration 425 took 56s (9.05% Gen, 88.14% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 49m 40s. Estimated total time: 15h 38m 38s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 19s. [2026-03-25 21:13:32,365][__main__][INFO] - Starting iteration 425. [2026-03-25 21:13:32,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:13:32,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:13:37,841][__main__][INFO] - Number of regex retries in iteration 425: 0 [2026-03-25 21:13:37,842][__main__][INFO] - agents played in iteration 425 are Bob, Alice [2026-03-25 21:13:38,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:13:38,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:13:38,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:13:38,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:13:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:13:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:13:40,444][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:13:41,161][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:13:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:13:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:13:43,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:13:44,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:13:44,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:13:45,466][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:13:46,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:13:46,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:13:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:13:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:13:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:13:49,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:13:50,494][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:13:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:13:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:13:52,650][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:13:53,369][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:13:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:13:54,807][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:13:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:13:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:13:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:13:57,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:13:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:13:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:13:59,839][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:14:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:14:01,278][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:14:01,997][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:14:02,718][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:14:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:14:04,156][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:14:04,877][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:14:05,598][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:14:06,319][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:14:07,041][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:14:07,760][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:14:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:14:09,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:14:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:14:10,642][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:14:11,362][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:14:12,084][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:14:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:14:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:14:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:14:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:14:15,913][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:14:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:14:17,354][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:14:18,075][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:14:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:14:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:14:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:14:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:14:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:14:22,396][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:14:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:14:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:14:24,558][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:14:25,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:14:26,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:14:27,232][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:14:27,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:14:27,238][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:14:28,636][__main__][INFO] - Iteration 426 took 56s (9.72% Gen, 87.79% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 47m 53s. Estimated total time: 15h 37m 47s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 53s. [2026-03-25 21:14:28,638][__main__][INFO] - Starting iteration 426. [2026-03-25 21:14:28,642][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:14:28,643][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:14:34,823][__main__][INFO] - Number of regex retries in iteration 426: 0 [2026-03-25 21:14:34,824][__main__][INFO] - agents played in iteration 426 are Bob, Alice [2026-03-25 21:14:35,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:14:35,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:14:35,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:14:35,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:14:36,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:14:36,709][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:14:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:14:38,146][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:14:38,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:14:39,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:14:40,302][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:14:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:14:41,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:14:42,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:14:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:14:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:14:44,610][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:14:45,328][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:14:46,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:14:46,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:14:47,485][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:14:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:14:48,923][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:14:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:14:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:14:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:14:51,799][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:14:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:14:53,237][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:14:53,957][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:14:54,677][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:14:55,395][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:14:56,115][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:14:56,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:14:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:14:58,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:14:59,001][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:14:59,725][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:15:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:15:01,171][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:15:01,893][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:15:02,615][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:15:03,339][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:15:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:15:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:15:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:15:06,230][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:15:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:15:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:15:08,399][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:15:09,120][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:15:09,842][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:15:10,852][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:15:11,576][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:15:12,298][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:15:13,019][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:15:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:15:14,466][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:15:15,189][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:15:15,913][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:15:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:15:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:15:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:15:18,805][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:15:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:15:20,251][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:15:20,975][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:15:21,699][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:15:22,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:15:23,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:15:24,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:15:24,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:15:24,240][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:15:25,592][__main__][INFO] - Iteration 427 took 56s (10.85% Gen, 86.77% Train). Generation: 6s, Training: 49s. Estimated remaining time: 8h 58m 21s. Estimated total time: 15h 49m 12s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 55s, 500 more iterations: 7h 54m 36s. [2026-03-25 21:15:25,595][__main__][INFO] - Starting iteration 427. [2026-03-25 21:15:25,599][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:15:25,600][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:15:30,655][__main__][INFO] - Number of regex retries in iteration 427: 0 [2026-03-25 21:15:30,656][__main__][INFO] - agents played in iteration 427 are Bob, Alice [2026-03-25 21:15:31,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:15:31,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:15:31,239][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:15:31,240][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:15:31,945][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:15:32,596][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:15:33,319][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:15:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:15:34,761][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:15:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:15:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:15:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:15:37,644][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:15:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:15:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:15:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:15:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:15:41,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:15:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:15:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:15:43,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:15:44,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:15:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:15:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:15:46,303][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:15:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:15:47,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:15:48,468][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:15:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:15:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:15:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:15:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:15:52,085][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:15:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:15:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:15:54,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:15:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:15:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:15:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:15:57,143][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:15:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:15:58,588][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:15:59,311][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:16:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:16:00,752][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:16:01,475][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:16:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:16:02,918][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:16:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:16:04,361][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:16:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:16:05,806][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:16:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:16:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:16:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:16:08,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:16:09,652][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:16:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:16:11,099][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:16:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:16:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:16:13,272][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:16:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:16:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:16:15,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:16:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:16:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:16:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:16:18,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:16:19,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:16:20,187][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:16:20,190][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:16:20,192][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:16:21,667][__main__][INFO] - Iteration 428 took 56s (9.02% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 42m 42s. Estimated total time: 15h 34m 30s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 15s. [2026-03-25 21:16:21,670][__main__][INFO] - Starting iteration 428. [2026-03-25 21:16:21,674][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:16:21,675][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:16:26,706][__main__][INFO] - Number of regex retries in iteration 428: 0 [2026-03-25 21:16:26,707][__main__][INFO] - agents played in iteration 428 are Bob, Alice [2026-03-25 21:16:27,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:16:27,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:16:27,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:16:27,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:16:27,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:16:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:16:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:16:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:16:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:16:31,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:16:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:16:32,927][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:16:33,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:16:34,368][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:16:35,088][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:16:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:16:36,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:16:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:16:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:16:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:16:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:16:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:16:40,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:16:41,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:16:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:16:43,020][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:16:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:16:44,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:16:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:16:45,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:16:46,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:16:47,345][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:16:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:16:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:16:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:16:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:16:50,951][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:16:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:16:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:16:53,114][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:16:53,836][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:16:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:16:55,279][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:16:56,000][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:16:56,722][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:16:57,443][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:16:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:16:58,886][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:16:59,608][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:17:00,328][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:17:01,050][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:17:01,772][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:17:02,744][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:17:03,467][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:17:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:17:04,911][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:17:05,631][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:17:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:17:07,075][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:17:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:17:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:17:09,245][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:17:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:17:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:17:11,412][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:17:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:17:12,858][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:17:13,581][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:17:14,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:17:15,030][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:17:16,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:17:16,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:17:16,355][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:17:17,729][__main__][INFO] - Iteration 429 took 56s (8.98% Gen, 88.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 41m 33s. Estimated total time: 15h 34m 16s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 8s. [2026-03-25 21:17:17,733][__main__][INFO] - Starting iteration 429. [2026-03-25 21:17:17,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:17:17,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:17:22,713][__main__][INFO] - Number of regex retries in iteration 429: 0 [2026-03-25 21:17:22,714][__main__][INFO] - agents played in iteration 429 are Bob, Alice [2026-03-25 21:17:23,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:17:23,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:17:23,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:17:23,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:17:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:17:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:17:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:17:26,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:17:26,856][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:17:27,578][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:17:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:17:29,017][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:17:29,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:17:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:17:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:17:31,899][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:17:32,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:17:33,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:17:34,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:17:34,779][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:17:35,499][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:17:36,219][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:17:36,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:17:37,661][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:17:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:17:39,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:17:39,825][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:17:40,546][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:17:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:17:41,988][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:17:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:17:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:17:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:17:44,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:17:45,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:17:46,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:17:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:17:47,761][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:17:48,483][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:17:49,206][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:17:49,926][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:17:50,646][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:17:51,370][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:17:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:17:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:17:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:17:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:17:54,978][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:17:55,701][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:17:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:17:57,144][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:17:57,867][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:17:58,868][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:17:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:18:00,311][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:18:01,034][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:18:01,759][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:18:02,480][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:18:03,203][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:18:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:18:04,647][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:18:05,371][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:18:06,093][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:18:06,816][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:18:07,538][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:18:08,262][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:18:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:18:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:18:10,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:18:11,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:18:12,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:18:12,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:18:12,431][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:18:13,769][__main__][INFO] - Iteration 430 took 56s (8.88% Gen, 88.73% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 40m 14s. Estimated total time: 15h 33m 53s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 56s. [2026-03-25 21:18:13,771][__main__][INFO] - Starting iteration 430. [2026-03-25 21:18:13,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:18:13,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:18:25,247][__main__][INFO] - Number of regex retries in iteration 430: 0 [2026-03-25 21:18:25,248][__main__][INFO] - agents played in iteration 430 are Bob, Alice [2026-03-25 21:18:25,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:18:25,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:18:25,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:18:25,857][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:18:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:18:27,190][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:18:27,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:18:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:18:29,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:18:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:18:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:18:31,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:18:32,215][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:18:32,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:18:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:18:34,371][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:18:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:18:35,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:18:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:18:37,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:18:37,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:18:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:18:39,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:18:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:18:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:18:41,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:18:42,275][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:18:42,995][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:18:43,713][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:18:44,433][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:18:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:18:45,873][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:18:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:18:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:18:48,033][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:18:48,752][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:18:49,473][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:18:50,192][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:18:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:18:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:18:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:18:53,075][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:18:53,794][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:18:54,516][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:18:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:18:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:18:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:18:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:18:58,120][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:18:58,842][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:18:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:19:00,286][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:19:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:19:01,965][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:19:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:19:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:19:04,128][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:19:04,849][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:19:05,572][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:19:06,293][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:19:07,015][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:19:07,736][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:19:08,459][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:19:09,183][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:19:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:19:10,626][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:19:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:19:12,070][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:19:12,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:19:13,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:19:14,590][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:19:14,594][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:19:14,595][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:19:16,023][__main__][INFO] - Iteration 431 took 1m 2s (18.43% Gen, 79.27% Train). Generation: 11s, Training: 49s. Estimated remaining time: 10h 22m 47s. Estimated total time: 17h 17m 28s. Time estimates for 10 more iterations: 10m 22s, 100 more iterations: 1h 43m 44s, 500 more iterations: 8h 38m 44s. [2026-03-25 21:19:16,025][__main__][INFO] - Starting iteration 431. [2026-03-25 21:19:16,032][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:19:16,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:19:20,899][__main__][INFO] - Number of regex retries in iteration 431: 0 [2026-03-25 21:19:20,901][__main__][INFO] - agents played in iteration 431 are Bob, Alice [2026-03-25 21:19:21,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:19:21,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:19:21,462][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:19:21,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:19:22,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:19:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:19:23,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:19:24,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:19:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:19:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:19:26,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:19:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:19:27,841][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:19:28,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:19:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:19:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:19:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:19:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:19:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:19:32,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:19:33,600][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:19:34,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:19:35,042][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:19:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:19:36,483][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:19:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:19:37,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:19:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:19:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:19:40,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:19:40,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:19:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:19:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:19:42,977][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:19:43,697][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:19:44,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:19:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:19:45,862][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:19:46,584][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:19:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:19:48,029][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:19:48,750][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:19:49,472][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:19:50,196][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:19:50,916][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:19:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:19:52,363][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:19:53,086][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:19:53,808][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:19:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:19:55,255][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:19:55,978][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:19:56,943][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:19:57,665][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:19:58,388][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:19:59,112][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:19:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:20:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:20:01,285][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:20:02,008][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:20:02,729][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:20:03,453][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:20:04,176][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:20:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:20:05,626][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:20:06,350][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:20:07,075][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:20:07,798][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:20:08,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:20:09,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:20:10,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:20:10,359][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:20:10,361][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:20:11,724][__main__][INFO] - Iteration 432 took 55s (8.74% Gen, 88.81% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 32m 35s. Estimated total time: 15h 28m 13s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 6s. [2026-03-25 21:20:11,727][__main__][INFO] - Starting iteration 432. [2026-03-25 21:20:11,734][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:20:11,734][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:20:14,194][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2026-03-25 21:20:17,389][__main__][INFO] - Number of regex retries in iteration 432: 1 [2026-03-25 21:20:17,390][__main__][INFO] - agents played in iteration 432 are Bob, Alice [2026-03-25 21:20:17,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:20:17,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:20:17,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:20:17,960][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:20:18,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:20:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:20:20,027][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:20:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:20:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:20:22,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:20:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:20:23,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:20:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:20:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:20:25,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:20:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:20:27,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:20:27,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:20:28,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:20:29,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:20:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:20:30,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:20:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:20:32,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:20:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:20:33,729][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:20:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:20:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:20:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:20:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:20:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:20:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:20:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:20:39,506][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:20:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:20:40,953][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:20:41,676][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:20:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:20:43,121][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:20:43,843][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:20:44,566][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:20:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:20:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:20:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:20:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:20:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:20:48,903][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:20:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:20:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:20:51,071][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:20:51,796][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:20:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:20:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:20:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:20:54,963][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:20:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:20:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:20:57,134][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:20:57,858][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:20:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:20:59,305][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:21:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:21:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:21:01,475][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:21:02,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:21:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:21:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:21:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:21:05,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:21:05,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:21:06,986][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:21:06,989][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:21:06,994][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:21:08,642][__main__][INFO] - Iteration 433 took 56s (9.94% Gen, 87.16% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 51m 57s. Estimated total time: 15h 48m 31s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 51s, 500 more iterations: 7h 54m 15s. [2026-03-25 21:21:08,645][__main__][INFO] - Starting iteration 433. [2026-03-25 21:21:08,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:21:08,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:21:14,111][__main__][INFO] - Number of regex retries in iteration 433: 0 [2026-03-25 21:21:14,112][__main__][INFO] - agents played in iteration 433 are Bob, Alice [2026-03-25 21:21:14,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:21:14,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:21:14,725][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:21:14,726][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:21:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:21:16,077][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:21:16,800][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:21:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:21:18,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:21:18,959][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:21:19,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:21:20,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:21:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:21:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:21:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:21:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:21:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:21:24,728][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:21:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:21:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:21:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:21:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:21:28,333][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:21:29,057][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:21:29,779][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:21:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:21:31,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:21:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:21:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:21:33,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:21:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:21:34,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:21:35,555][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:21:36,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:21:37,000][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:21:37,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:21:38,444][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:21:39,169][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:21:39,891][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:21:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:21:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:21:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:21:42,785][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:21:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:21:44,230][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:21:44,953][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:21:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:21:46,400][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:21:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:21:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:21:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:21:49,295][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:21:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:21:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:21:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:21:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:21:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:21:53,867][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:21:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:21:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:21:56,040][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:21:56,764][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:21:57,486][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:21:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:21:58,934][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:21:59,658][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:22:00,384][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:22:01,108][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:22:01,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:22:02,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:22:03,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:22:03,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:22:03,644][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:22:05,091][__main__][INFO] - Iteration 434 took 56s (9.68% Gen, 87.75% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 43m 13s. Estimated total time: 15h 40m 44s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 4s, 500 more iterations: 7h 50m 22s. [2026-03-25 21:22:05,094][__main__][INFO] - Starting iteration 434. [2026-03-25 21:22:05,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:22:05,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:22:10,074][__main__][INFO] - Number of regex retries in iteration 434: 0 [2026-03-25 21:22:10,076][__main__][INFO] - agents played in iteration 434 are Bob, Alice [2026-03-25 21:22:10,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:22:10,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:22:10,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:22:10,635][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:22:11,319][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:22:11,969][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:22:12,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:22:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:22:14,135][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:22:14,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:22:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:22:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:22:17,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:22:17,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:22:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:22:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:22:19,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:22:20,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:22:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:22:22,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:22:22,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:22:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:22:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:22:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:22:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:22:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:22:27,126][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:22:27,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:22:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:22:29,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:22:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:22:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:22:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:22:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:22:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:22:33,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:22:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:22:35,080][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:22:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:22:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:22:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:22:37,976][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:22:38,699][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:22:39,423][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:22:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:22:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:22:41,597][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:22:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:22:43,047][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:22:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:22:44,495][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:22:45,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:22:46,172][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:22:46,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:22:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:22:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:22:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:22:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:22:50,516][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:22:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:22:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:22:52,691][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:22:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:22:54,141][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:22:54,865][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:22:55,588][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:22:56,314][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:22:57,037][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:22:57,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:22:58,506][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:22:59,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:22:59,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:22:59,559][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:23:01,078][__main__][INFO] - Iteration 435 took 55s (8.89% Gen, 88.39% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 34m 35s. Estimated total time: 15h 33m 2s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 31s. [2026-03-25 21:23:01,081][__main__][INFO] - Starting iteration 435. [2026-03-25 21:23:01,085][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:23:01,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:23:06,078][__main__][INFO] - Number of regex retries in iteration 435: 0 [2026-03-25 21:23:06,080][__main__][INFO] - agents played in iteration 435 are Bob, Alice [2026-03-25 21:23:06,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:23:06,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:23:06,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:23:06,645][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:23:07,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:23:07,993][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:23:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:23:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:23:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:23:10,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:23:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:23:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:23:13,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:23:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:23:14,490][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:23:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:23:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:23:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:23:17,378][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:23:18,104][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:23:18,826][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:23:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:23:20,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:23:20,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:23:21,720][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:23:22,441][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:23:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:23:23,887][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:23:24,609][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:23:25,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:23:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:23:26,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:23:27,504][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:23:28,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:23:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:23:29,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:23:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:23:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:23:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:23:32,567][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:23:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:23:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:23:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:23:35,463][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:23:36,188][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:23:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:23:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:23:38,359][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:23:39,084][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:23:39,808][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:23:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:23:41,256][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:23:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:23:43,015][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:23:43,738][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:23:44,463][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:23:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:23:45,914][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:23:46,641][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:23:47,365][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:23:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:23:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:23:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:23:50,262][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:23:50,986][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:23:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:23:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:23:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:23:53,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:23:54,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:23:55,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:23:55,919][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:23:55,921][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:23:57,285][__main__][INFO] - Iteration 436 took 56s (8.88% Gen, 88.68% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 37m 19s. Estimated total time: 15h 36m 42s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 40s, 500 more iterations: 7h 48m 21s. [2026-03-25 21:23:57,288][__main__][INFO] - Starting iteration 436. [2026-03-25 21:23:57,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:23:57,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:24:02,237][__main__][INFO] - Number of regex retries in iteration 436: 0 [2026-03-25 21:24:02,238][__main__][INFO] - agents played in iteration 436 are Bob, Alice [2026-03-25 21:24:02,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:24:02,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:24:02,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:24:02,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:24:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:24:04,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:24:04,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:24:05,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:24:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:24:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:24:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:24:08,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:24:09,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:24:09,940][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:24:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:24:11,384][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:24:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:24:12,830][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:24:13,586][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:24:14,300][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:24:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:24:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:24:16,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:24:17,190][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:24:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:24:18,637][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:24:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:24:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:24:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:24:21,529][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:24:22,252][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:24:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:24:23,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:24:24,424][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:24:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:24:25,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:24:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:24:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:24:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:24:28,769][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:24:29,491][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:24:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:24:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:24:31,662][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:24:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:24:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:24:33,836][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:24:34,559][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:24:35,282][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:24:36,006][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:24:36,730][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:24:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:24:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:24:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:24:39,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:24:40,582][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:24:41,307][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:24:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:24:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:24:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:24:44,210][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:24:44,935][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:24:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:24:46,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:24:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:24:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:24:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:24:49,285][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:24:50,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:24:50,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:24:52,021][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:24:52,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:24:52,027][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:24:53,459][__main__][INFO] - Iteration 437 took 56s (8.80% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 35m 49s. Estimated total time: 15h 36m 8s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 4s. [2026-03-25 21:24:53,465][__main__][INFO] - Starting iteration 437. [2026-03-25 21:24:53,474][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:24:53,475][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:24:58,454][__main__][INFO] - Number of regex retries in iteration 437: 0 [2026-03-25 21:24:58,455][__main__][INFO] - agents played in iteration 437 are Bob, Alice [2026-03-25 21:24:59,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:24:59,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:24:59,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:24:59,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:24:59,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:25:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:25:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:25:01,897][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:25:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:25:03,340][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:25:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:25:04,783][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:25:05,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:25:06,226][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:25:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:25:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:25:08,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:25:09,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:25:09,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:25:10,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:25:11,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:25:12,008][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:25:12,732][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:25:13,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:25:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:25:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:25:15,623][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:25:16,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:25:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:25:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:25:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:25:19,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:25:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:25:20,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:25:21,413][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:25:22,137][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:25:22,860][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:25:23,585][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:25:24,308][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:25:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:25:25,757][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:25:26,482][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:25:27,205][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:25:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:25:28,653][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:25:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:25:30,103][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:25:30,828][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:25:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:25:32,276][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:25:32,999][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:25:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:25:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:25:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:25:36,130][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:25:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:25:37,579][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:25:38,305][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:25:39,031][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:25:39,757][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:25:40,482][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:25:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:25:41,931][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:25:42,656][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:25:43,379][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:25:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:25:44,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:25:45,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:25:46,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:25:47,014][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:25:48,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:25:48,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:25:48,064][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:25:49,438][__main__][INFO] - Iteration 438 took 55s (8.90% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 31m 30s. Estimated total time: 15h 32m 45s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 16s, 500 more iterations: 7h 46m 22s. [2026-03-25 21:25:49,442][__main__][INFO] - Starting iteration 438. [2026-03-25 21:25:49,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:25:49,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:25:54,975][__main__][INFO] - Number of regex retries in iteration 438: 0 [2026-03-25 21:25:54,976][__main__][INFO] - agents played in iteration 438 are Bob, Alice [2026-03-25 21:25:55,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:25:55,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:25:55,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:25:55,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:25:56,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:25:58,616][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:25:59,338][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:26:00,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:26:00,784][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:26:03,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:26:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:26:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:26:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:26:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:26:07,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:26:08,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:26:08,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:26:09,485][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:26:10,206][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:26:10,927][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:26:11,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:26:12,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:26:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:26:13,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:26:14,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:26:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:26:15,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:26:16,707][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:26:17,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:26:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:26:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:26:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:26:20,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:26:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:26:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:26:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:26:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:26:23,942][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:26:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:26:25,392][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:26:26,117][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:26:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:26:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:26:28,292][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:26:29,017][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:26:29,740][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:26:30,464][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:26:31,189][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:26:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:26:32,639][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:26:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:26:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:26:35,143][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:26:35,867][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:26:36,592][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:26:37,315][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:26:38,039][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:26:38,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:26:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:26:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:26:40,935][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:26:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:26:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:26:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:26:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:26:44,557][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:26:45,278][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:26:46,003][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:26:46,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:26:47,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:51 [2026-03-25 21:26:48,805][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:26:48,809][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:26:48,811][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:26:50,189][__main__][INFO] - Iteration 439 took 1m 0s (9.07% Gen, 88.66% Train). Generation: 5s, Training: 53s. Estimated remaining time: 9h 49m 48s. Estimated total time: 16h 52m 4s. Time estimates for 10 more iterations: 10m 7s, 100 more iterations: 1h 41m 12s, 500 more iterations: 8h 26m 2s. [2026-03-25 21:26:50,192][__main__][INFO] - Starting iteration 439. [2026-03-25 21:26:50,196][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:26:50,197][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:26:55,074][__main__][INFO] - Number of regex retries in iteration 439: 0 [2026-03-25 21:26:55,075][__main__][INFO] - agents played in iteration 439 are Bob, Alice [2026-03-25 21:26:55,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:26:55,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:26:55,641][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:26:55,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:26:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:26:57,000][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:26:57,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:26:58,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:26:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:26:59,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:27:00,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:27:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:27:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:27:02,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:27:03,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:27:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:27:04,945][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:27:05,666][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:27:06,389][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:27:07,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:27:07,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:27:08,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:27:09,278][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:27:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:27:10,724][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:27:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:27:12,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:27:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:27:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:27:14,341][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:27:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:27:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:27:16,514][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:27:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:27:17,959][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:27:18,682][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:27:19,405][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:27:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:27:20,853][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:27:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:27:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:27:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:27:23,746][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:27:24,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:27:25,195][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:27:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:27:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:27:27,364][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:27:28,088][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:27:28,813][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:27:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:27:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:27:31,213][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:27:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:27:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:27:33,386][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:27:34,110][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:27:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:27:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:27:36,285][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:27:37,008][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:27:37,737][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:27:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:27:39,184][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:27:39,910][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:27:40,634][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:27:41,359][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:27:42,084][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:27:42,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:27:43,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:27:44,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:27:44,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:27:44,573][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:27:46,128][__main__][INFO] - Iteration 440 took 55s (8.72% Gen, 88.49% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 29m 1s. Estimated total time: 15h 32m 13s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 13s, 500 more iterations: 7h 46m 6s. [2026-03-25 21:27:46,130][__main__][INFO] - Starting iteration 440. [2026-03-25 21:27:46,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:27:46,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:27:51,047][__main__][INFO] - Number of regex retries in iteration 440: 0 [2026-03-25 21:27:51,048][__main__][INFO] - agents played in iteration 440 are Bob, Alice [2026-03-25 21:27:51,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:27:51,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:27:51,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:27:51,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:27:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:27:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:27:53,807][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:27:54,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:27:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:27:55,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:27:56,692][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:27:57,414][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:27:58,137][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:27:58,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:27:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:28:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:28:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:28:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:28:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:28:03,196][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:28:03,918][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:28:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:28:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:28:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:28:06,811][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:28:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:28:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:28:08,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:28:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:28:10,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:28:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:28:11,873][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:28:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:28:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:28:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:28:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:28:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:28:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:28:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:28:17,661][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:28:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:28:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:28:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:28:20,559][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:28:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:28:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:28:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:28:23,463][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:28:24,188][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:28:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:28:25,639][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:28:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:28:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:28:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:28:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:28:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:28:30,219][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:28:30,945][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:28:31,669][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:28:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:28:33,116][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:28:33,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:28:34,567][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:28:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:28:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:28:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:28:37,468][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:28:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:28:38,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:28:39,668][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:28:41,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:28:41,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:28:41,058][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:28:42,397][__main__][INFO] - Iteration 441 took 56s (8.73% Gen, 88.89% Train). Generation: 4s, Training: 50s. Estimated remaining time: 8h 33m 36s. Estimated total time: 15h 37m 44s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 52s. [2026-03-25 21:28:42,404][__main__][INFO] - Starting iteration 441. [2026-03-25 21:28:42,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:28:42,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:28:47,335][__main__][INFO] - Number of regex retries in iteration 441: 0 [2026-03-25 21:28:47,336][__main__][INFO] - agents played in iteration 441 are Bob, Alice [2026-03-25 21:28:47,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:28:47,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:28:47,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:28:47,917][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:28:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:28:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:28:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:28:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:28:51,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:28:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:28:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:28:53,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:28:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:28:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:28:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:28:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:28:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:28:57,961][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:28:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:28:59,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:29:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:29:00,852][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:29:01,575][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:29:02,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:29:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:29:03,747][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:29:04,469][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:29:05,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:29:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:29:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:29:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:29:08,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:29:08,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:29:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:29:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:29:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:29:11,709][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:29:12,433][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:29:13,158][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:29:13,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:29:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:29:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:29:16,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:29:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:29:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:29:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:29:18,951][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:29:19,676][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:29:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:29:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:29:21,852][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:29:22,577][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:29:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:29:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:29:25,064][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:29:25,789][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:29:26,518][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:29:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:29:27,968][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:29:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:29:29,420][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:29:30,149][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:29:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:29:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:29:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:29:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:29:33,779][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:29:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:29:35,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:29:35,998][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:29:38,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:29:38,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:29:38,019][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:29:39,342][__main__][INFO] - Iteration 442 took 56s (8.65% Gen, 89.02% Train). Generation: 4s, Training: 50s. Estimated remaining time: 8h 43m 49s. Estimated total time: 15h 48m 54s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 53s, 500 more iterations: 7h 54m 27s. [2026-03-25 21:29:39,344][__main__][INFO] - Starting iteration 442. [2026-03-25 21:29:39,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:29:39,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:29:44,258][__main__][INFO] - Number of regex retries in iteration 442: 0 [2026-03-25 21:29:44,259][__main__][INFO] - agents played in iteration 442 are Bob, Alice [2026-03-25 21:29:44,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:29:44,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:29:44,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:29:44,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:29:45,508][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:29:46,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:29:46,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:29:47,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:29:48,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:29:49,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:29:49,768][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:29:50,490][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:29:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:29:51,934][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:29:52,657][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:29:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:29:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:29:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:29:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:29:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:29:56,993][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:29:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:29:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:29:59,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:29:59,886][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:30:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:30:01,333][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:30:02,055][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:30:02,777][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:30:03,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:30:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:30:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:30:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:30:06,393][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:30:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:30:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:30:08,564][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:30:09,289][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:30:10,013][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:30:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:30:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:30:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:30:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:30:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:30:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:30:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:30:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:30:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:30:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:30:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:30:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:30:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:30:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:30:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:30:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:30:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:30:23,286][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:30:24,012][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:30:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:30:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:30:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:30:26,909][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:30:27,633][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:30:28,357][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:30:29,082][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:30:29,807][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:30:30,532][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:30:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:30:31,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:30:32,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:30:33,910][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:30:33,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:30:33,916][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:30:35,319][__main__][INFO] - Iteration 443 took 55s (8.77% Gen, 88.72% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 26m 51s. Estimated total time: 15h 32m 52s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 26s. [2026-03-25 21:30:35,322][__main__][INFO] - Starting iteration 443. [2026-03-25 21:30:35,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:30:35,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:30:40,275][__main__][INFO] - Number of regex retries in iteration 443: 0 [2026-03-25 21:30:40,276][__main__][INFO] - agents played in iteration 443 are Bob, Alice [2026-03-25 21:30:40,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:30:40,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:30:40,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:30:40,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:30:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:30:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:30:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:30:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:30:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:30:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:30:45,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:30:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:30:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:30:47,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:30:48,678][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:30:49,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:30:50,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:30:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:30:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:30:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:30:53,016][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:30:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:30:54,466][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:30:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:30:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:30:56,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:30:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:30:58,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:30:58,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:30:59,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:31:00,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:31:00,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:31:01,701][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:31:02,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:31:03,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:31:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:31:04,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:31:05,321][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:31:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:31:06,769][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:31:07,493][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:31:08,218][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:31:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:31:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:31:10,396][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:31:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:31:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:31:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:31:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:31:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:31:14,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:31:15,468][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:31:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:31:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:31:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:31:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:31:19,328][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:31:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:31:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:31:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:31:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:31:22,950][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:31:23,675][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:31:24,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:31:25,126][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:31:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:31:26,576][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:31:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:31:28,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:31:28,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:31:30,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:31:30,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:31:30,021][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:31:31,409][__main__][INFO] - Iteration 444 took 56s (8.82% Gen, 88.70% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 27m 48s. Estimated total time: 15h 34m 45s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 28s, 500 more iterations: 7h 47m 22s. [2026-03-25 21:31:31,412][__main__][INFO] - Starting iteration 444. [2026-03-25 21:31:31,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:31:31,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:31:36,456][__main__][INFO] - Number of regex retries in iteration 444: 0 [2026-03-25 21:31:36,457][__main__][INFO] - agents played in iteration 444 are Bob, Alice [2026-03-25 21:31:37,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:31:37,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:31:37,110][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:31:37,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:31:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:31:38,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:31:39,170][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:31:39,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:31:40,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:31:41,335][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:31:42,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:31:42,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:31:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:31:44,224][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:31:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:31:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:31:46,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:31:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:31:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:31:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:31:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:31:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:31:50,730][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:31:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:31:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:31:52,903][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:31:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:31:54,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:31:55,076][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:31:55,801][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:31:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:31:57,251][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:31:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:31:58,699][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:31:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:32:00,146][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:32:00,870][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:32:01,595][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:32:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:32:03,041][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:32:03,766][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:32:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:32:05,214][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:32:05,937][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:32:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:32:07,386][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:32:08,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:32:08,835][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:32:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:32:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:32:11,019][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:32:11,744][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:32:12,778][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:32:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:32:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:32:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:32:15,676][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:32:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:32:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:32:17,848][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:32:18,587][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:32:19,310][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:32:20,035][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:32:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:32:21,484][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:32:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:32:22,936][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:32:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:32:24,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:32:25,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:32:26,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:32:26,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:32:26,394][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:32:27,827][__main__][INFO] - Iteration 445 took 56s (8.92% Gen, 88.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 32m 10s. Estimated total time: 15h 40m 3s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 0s, 500 more iterations: 7h 50m 1s. [2026-03-25 21:32:27,830][__main__][INFO] - Starting iteration 445. [2026-03-25 21:32:27,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:32:27,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:32:32,868][__main__][INFO] - Number of regex retries in iteration 445: 0 [2026-03-25 21:32:32,869][__main__][INFO] - agents played in iteration 445 are Bob, Alice [2026-03-25 21:32:33,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:32:33,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:32:33,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:32:33,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:32:34,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:32:34,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:32:35,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:32:36,254][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:32:36,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:32:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:32:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:32:39,141][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:32:39,863][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:32:40,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:32:41,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:32:42,026][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:32:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:32:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:32:44,190][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:32:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:32:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:32:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:32:47,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:32:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:32:48,520][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:32:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:32:49,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:32:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:32:51,410][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:32:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:32:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:32:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:32:54,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:32:55,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:32:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:32:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:32:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:32:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:32:58,637][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:32:59,359][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:33:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:33:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:33:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:33:02,254][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:33:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:33:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:33:04,422][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:33:05,146][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:33:05,871][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:33:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:33:07,316][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:33:08,039][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:33:08,998][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:33:09,721][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:33:10,444][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:33:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:33:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:33:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:33:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:33:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:33:14,783][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:33:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:33:16,233][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:33:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:33:17,679][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:33:18,403][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:33:19,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:33:19,851][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:33:20,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:33:21,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:33:22,375][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:33:22,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:33:22,382][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:33:23,853][__main__][INFO] - Iteration 446 took 56s (8.99% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 24m 51s. Estimated total time: 15h 33m 41s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 50s. [2026-03-25 21:33:23,857][__main__][INFO] - Starting iteration 446. [2026-03-25 21:33:23,863][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:33:23,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:33:29,307][__main__][INFO] - Number of regex retries in iteration 446: 0 [2026-03-25 21:33:29,308][__main__][INFO] - agents played in iteration 446 are Bob, Alice [2026-03-25 21:33:29,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:33:30,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:33:30,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:33:30,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:33:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:33:31,352][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:33:32,074][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:33:32,794][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:33:33,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:33:34,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:33:34,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:33:35,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:33:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:33:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:33:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:33:40,224][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:33:40,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:33:41,662][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:33:42,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:33:43,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:33:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:33:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:33:45,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:33:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:33:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:33:47,708][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:33:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:33:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:33:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:33:50,593][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:33:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:33:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:33:52,756][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:33:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:33:54,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:33:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:33:55,641][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:33:56,364][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:33:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:33:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:33:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:33:59,251][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:33:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:34:00,695][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:34:01,416][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:34:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:34:02,862][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:34:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:34:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:34:05,033][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:34:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:34:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:34:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:34:08,156][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:34:08,878][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:34:09,601][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:34:10,326][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:34:11,049][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:34:11,770][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:34:12,495][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:34:13,217][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:34:13,940][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:34:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:34:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:34:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:34:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:34:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:34:18,281][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:34:19,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:34:19,733][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:49 [2026-03-25 21:34:20,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:34:20,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:34:20,780][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:34:22,437][__main__][INFO] - Iteration 447 took 58s (9.29% Gen, 87.87% Train). Generation: 5s, Training: 51s. Estimated remaining time: 9h 6m 29s. Estimated total time: 16h 16m 17s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 37s, 500 more iterations: 8h 8m 8s. [2026-03-25 21:34:22,440][__main__][INFO] - Starting iteration 447. [2026-03-25 21:34:22,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:34:22,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:34:27,440][__main__][INFO] - Number of regex retries in iteration 447: 0 [2026-03-25 21:34:27,441][__main__][INFO] - agents played in iteration 447 are Bob, Alice [2026-03-25 21:34:27,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:34:28,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:34:28,006][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:34:28,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:34:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:34:29,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:34:30,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:34:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:34:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:34:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:34:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:34:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:34:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:34:35,101][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:34:35,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:34:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:34:37,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:34:37,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:34:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:34:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:34:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:34:40,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:34:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:34:42,312][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:34:43,033][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:34:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:34:44,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:34:45,197][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:34:45,919][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:34:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:34:47,363][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:34:48,084][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:34:48,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:34:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:34:50,251][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:34:50,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:34:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:34:52,416][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:34:53,139][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:34:53,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:34:54,586][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:34:55,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:34:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:34:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:34:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:34:58,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:34:58,920][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:34:59,641][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:35:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:35:01,089][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:35:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:35:02,534][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:35:03,565][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:35:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:35:05,013][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:35:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:35:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:35:07,182][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:35:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:35:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:35:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:35:10,075][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:35:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:35:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:35:12,246][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:35:12,970][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:35:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:35:14,414][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:35:15,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:35:15,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:35:17,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:35:17,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:35:17,013][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:35:18,761][__main__][INFO] - Iteration 448 took 56s (8.87% Gen, 88.02% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 27m 53s. Estimated total time: 15h 38m 37s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 18s. [2026-03-25 21:35:18,766][__main__][INFO] - Starting iteration 448. [2026-03-25 21:35:18,772][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:35:18,774][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:35:23,841][__main__][INFO] - Number of regex retries in iteration 448: 0 [2026-03-25 21:35:23,842][__main__][INFO] - agents played in iteration 448 are Bob, Alice [2026-03-25 21:35:24,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:35:24,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:35:24,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:35:24,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:35:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:35:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:35:26,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:35:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:35:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:35:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:35:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:35:30,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:35:30,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:35:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:35:32,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:35:32,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:35:33,672][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:35:34,393][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:35:35,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:35:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:35:36,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:35:37,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:35:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:35:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:35:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:35:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:35:40,881][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:35:41,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:35:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:35:43,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:35:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:35:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:35:45,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:35:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:35:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:35:47,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:35:48,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:35:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:35:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:35:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:35:50,988][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:35:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:35:52,432][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:35:53,154][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:35:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:35:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:35:55,323][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:35:56,045][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:35:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:35:57,491][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:35:58,214][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:35:58,937][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:35:59,891][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:36:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:36:01,340][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:36:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:36:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:36:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:36:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:36:04,955][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:36:05,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:36:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:36:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:36:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:36:08,572][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:36:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:36:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:36:10,744][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:36:11,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:36:12,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:36:13,360][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:36:13,364][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:36:13,366][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:36:14,773][__main__][INFO] - Iteration 449 took 56s (9.05% Gen, 88.43% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 21m 43s. Estimated total time: 15h 33m 24s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 42s. [2026-03-25 21:36:14,777][__main__][INFO] - Starting iteration 449. [2026-03-25 21:36:14,783][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:36:14,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:36:19,862][__main__][INFO] - Number of regex retries in iteration 449: 0 [2026-03-25 21:36:19,863][__main__][INFO] - agents played in iteration 449 are Bob, Alice [2026-03-25 21:36:20,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:36:20,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:36:20,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:36:20,449][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:36:21,131][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:36:21,779][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:36:22,502][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:36:23,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:36:23,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:36:24,662][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:36:25,382][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:36:26,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:36:26,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:36:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:36:28,263][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:36:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:36:29,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:36:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:36:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:36:31,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:36:32,586][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:36:33,308][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:36:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:36:34,749][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:36:35,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:36:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:36:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:36:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:36:38,355][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:36:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:36:39,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:36:40,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:36:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:36:41,962][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:36:42,686][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:36:43,407][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:36:44,128][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:36:44,851][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:36:45,574][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:36:46,294][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:36:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:36:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:36:48,462][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:36:49,184][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:36:49,905][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:36:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:36:51,351][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:36:52,072][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:36:52,795][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:36:53,516][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:36:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:36:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:36:55,912][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:36:56,638][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:36:57,361][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:36:58,084][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:36:58,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:36:59,529][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:37:00,252][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:37:00,976][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:37:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:37:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:37:03,145][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:37:03,868][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:37:04,591][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:37:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:37:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:37:06,763][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:37:07,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:37:08,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:37:09,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:37:09,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:37:09,428][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:37:10,864][__main__][INFO] - Iteration 450 took 56s (9.05% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 22m 6s. Estimated total time: 15h 34m 43s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 28s, 500 more iterations: 7h 47m 21s. [2026-03-25 21:37:10,867][__main__][INFO] - Starting iteration 450. [2026-03-25 21:37:10,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2026-03-25 21:37:10,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:37:15,742][__main__][INFO] - Number of regex retries in iteration 450: 0 [2026-03-25 21:37:15,743][__main__][INFO] - agents played in iteration 450 are Bob, Alice [2026-03-25 21:37:16,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:37:16,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:37:16,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:37:16,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:37:16,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:37:17,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:37:18,362][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:37:19,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:37:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:37:20,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:37:21,240][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:37:21,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:37:22,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:37:23,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:37:24,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:37:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:37:25,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:37:26,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:37:27,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:37:27,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:37:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:37:29,163][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:37:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:37:30,605][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:37:31,324][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:37:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:37:32,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:37:33,489][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:37:34,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:37:34,931][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:37:35,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:37:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:37:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:37:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:37:38,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:37:39,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:37:39,990][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:37:40,710][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:37:41,432][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:37:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:37:42,874][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:37:43,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:37:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:37:45,039][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:37:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:37:46,483][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:37:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:37:47,924][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:37:48,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:37:49,368][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:37:50,090][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:37:50,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:37:51,844][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:37:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:37:53,290][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:37:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:37:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:37:55,455][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:37:56,177][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:37:56,900][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:37:57,623][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:37:58,345][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:37:59,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:37:59,791][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:38:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:38:01,236][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:38:01,958][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:38:02,680][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:38:03,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:38:04,158][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:38:05,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:38:05,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:38:05,212][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:38:07,915][__main__][INFO] - Iteration 451 took 57s (8.54% Gen, 86.72% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 37m 12s. Estimated total time: 15h 50m 46s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 4s, 500 more iterations: 7h 55m 23s. [2026-03-25 21:38:07,919][__main__][INFO] - Starting iteration 451. [2026-03-25 21:38:07,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:38:07,929][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:38:15,045][__main__][INFO] - Number of regex retries in iteration 451: 0 [2026-03-25 21:38:15,046][__main__][INFO] - agents played in iteration 451 are Bob, Alice [2026-03-25 21:38:15,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:38:15,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:38:15,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:38:15,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:38:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:38:16,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:38:17,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:38:18,421][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:38:19,139][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:38:19,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:38:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:38:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:38:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:38:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:38:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:38:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:38:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:38:25,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:38:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:38:27,038][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:38:27,756][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:38:28,474][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:38:29,193][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:38:29,912][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:38:30,632][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:38:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:38:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:38:32,791][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:38:33,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:38:34,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:38:34,954][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:38:35,673][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:38:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:38:37,115][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:38:37,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:38:38,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:38:39,280][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:38:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:38:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:38:41,442][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:38:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:38:42,882][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:38:43,600][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:38:44,321][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:38:45,042][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:38:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:38:46,483][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:38:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:38:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:38:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:38:49,363][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:38:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:38:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:38:51,753][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:38:52,473][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:38:53,192][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:38:53,914][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:38:54,634][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:38:55,354][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:38:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:38:56,794][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:38:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:38:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:38:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:38:59,680][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:39:00,402][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:39:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:39:01,845][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:39:02,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:39:03,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:39:04,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:39:04,548][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:39:04,550][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:39:05,926][__main__][INFO] - Iteration 452 took 58s (12.27% Gen, 85.35% Train). Generation: 7s, Training: 49s. Estimated remaining time: 8h 52m 9s. Estimated total time: 16h 6m 41s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 40s, 500 more iterations: 8h 3m 20s. [2026-03-25 21:39:05,929][__main__][INFO] - Starting iteration 452. [2026-03-25 21:39:05,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:39:05,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:39:11,261][__main__][INFO] - Number of regex retries in iteration 452: 0 [2026-03-25 21:39:11,262][__main__][INFO] - agents played in iteration 452 are Bob, Alice [2026-03-25 21:39:11,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:39:11,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:39:11,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:39:11,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:39:12,646][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:39:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:39:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:39:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:39:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:39:16,168][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:39:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:39:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:39:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:39:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:39:19,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:39:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:39:21,199][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:39:21,917][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:39:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:39:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:39:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:39:24,795][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:39:25,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:39:26,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:39:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:39:27,670][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:39:28,391][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:39:29,110][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:39:29,830][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:39:30,551][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:39:31,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:39:31,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:39:32,711][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:39:33,432][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:39:34,151][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:39:34,870][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:39:35,591][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:39:36,312][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:39:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:39:37,752][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:39:38,474][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:39:39,195][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:39:39,915][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:39:40,636][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:39:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:39:42,078][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:39:42,798][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:39:43,520][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:39:44,240][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:39:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:39:45,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:39:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:39:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:39:48,077][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:39:48,797][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:39:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:39:50,240][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:39:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:39:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:39:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:39:53,123][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:39:53,845][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:39:54,565][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:39:55,287][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:39:56,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:39:56,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:39:57,630][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:39:58,350][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:39:59,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:39:59,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:40:01,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:40:01,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:40:01,030][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:40:02,619][__main__][INFO] - Iteration 453 took 56s (9.40% Gen, 87.79% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 29m 19s. Estimated total time: 15h 44m 47s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 28s, 500 more iterations: 7h 52m 23s. [2026-03-25 21:40:02,622][__main__][INFO] - Starting iteration 453. [2026-03-25 21:40:02,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:40:02,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:40:09,317][__main__][INFO] - Number of regex retries in iteration 453: 0 [2026-03-25 21:40:09,318][__main__][INFO] - agents played in iteration 453 are Bob, Alice [2026-03-25 21:40:09,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:40:09,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:40:09,909][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:40:09,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:40:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:40:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:40:11,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:40:12,694][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:40:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:40:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:40:14,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:40:15,565][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:40:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:40:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:40:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:40:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:40:19,155][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:40:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:40:20,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:40:21,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:40:22,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:40:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:40:23,469][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:40:24,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:40:24,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:40:25,625][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:40:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:40:27,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:40:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:40:28,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:40:29,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:40:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:40:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:40:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:40:32,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:40:32,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:40:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:40:34,255][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:40:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:40:35,695][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:40:36,412][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:40:37,134][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:40:37,852][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:40:38,572][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:40:39,293][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:40:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:40:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:40:41,452][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:40:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:40:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:40:43,612][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:40:44,330][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:40:45,354][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:40:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:40:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:40:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:40:48,235][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:40:48,955][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:40:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:40:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:40:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:40:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:40:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:40:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:40:53,993][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:40:54,714][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:40:55,434][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:40:56,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:40:56,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:40:57,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:40:59,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:40:59,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:40:59,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:41:00,676][__main__][INFO] - Iteration 454 took 58s (11.52% Gen, 85.68% Train). Generation: 6s, Training: 49s. Estimated remaining time: 8h 51m 4s. Estimated total time: 16h 7m 31s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 45s, 500 more iterations: 8h 3m 45s. [2026-03-25 21:41:00,679][__main__][INFO] - Starting iteration 454. [2026-03-25 21:41:00,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:41:00,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:41:06,253][__main__][INFO] - Number of regex retries in iteration 454: 0 [2026-03-25 21:41:06,254][__main__][INFO] - agents played in iteration 454 are Bob, Alice [2026-03-25 21:41:06,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:41:06,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:41:06,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:41:06,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:41:07,507][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:41:08,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:41:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:41:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:41:10,306][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:41:11,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:41:11,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:41:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:41:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:41:13,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:41:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:41:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:41:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:41:16,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:41:17,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:41:18,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:41:18,921][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:41:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:41:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:41:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:41:21,793][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:41:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:41:23,232][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:41:23,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:41:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:41:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:41:26,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:41:26,833][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:41:27,553][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:41:28,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:41:28,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:41:29,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:41:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:41:31,149][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:41:31,869][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:41:32,588][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:41:33,308][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:41:34,027][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:41:34,747][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:41:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:41:36,188][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:41:36,909][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:41:37,628][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:41:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:41:39,067][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:41:39,787][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:41:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:41:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:41:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:41:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:41:43,617][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:41:44,338][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:41:45,057][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:41:45,778][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:41:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:41:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:41:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:41:48,658][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:41:49,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:41:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:41:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:41:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:41:52,264][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:41:52,982][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:41:53,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:41:54,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:41:55,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:41:55,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:41:55,530][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:41:56,882][__main__][INFO] - Iteration 455 took 56s (9.91% Gen, 87.68% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 19m 18s. Estimated total time: 15h 36m 40s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 40s, 500 more iterations: 7h 48m 20s. [2026-03-25 21:41:56,884][__main__][INFO] - Starting iteration 455. [2026-03-25 21:41:56,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:41:56,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:42:01,889][__main__][INFO] - Number of regex retries in iteration 455: 0 [2026-03-25 21:42:01,890][__main__][INFO] - agents played in iteration 455 are Bob, Alice [2026-03-25 21:42:02,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:42:02,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:42:02,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:42:02,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:42:03,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:42:03,793][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:42:04,514][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:42:05,230][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:42:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:42:06,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:42:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:42:08,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:42:08,824][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:42:09,544][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:42:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:42:10,981][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:42:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:42:12,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:42:13,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:42:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:42:14,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:42:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:42:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:42:16,730][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:42:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:42:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:42:18,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:42:19,608][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:42:20,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:42:21,048][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:42:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:42:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:42:23,205][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:42:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:42:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:42:25,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:42:26,085][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:42:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:42:27,527][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:42:28,247][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:42:28,967][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:42:29,688][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:42:30,407][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:42:31,126][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:42:31,848][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:42:32,568][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:42:33,288][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:42:34,009][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:42:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:42:35,449][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:42:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:42:36,890][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:42:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:42:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:42:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:42:40,009][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:42:40,729][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:42:41,449][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:42:42,171][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:42:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:42:43,612][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:42:44,333][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:42:45,056][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:42:45,778][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:42:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:42:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:42:47,941][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:42:48,664][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:42:49,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:42:50,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:42:51,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:42:51,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:42:51,405][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:42:53,622][__main__][INFO] - Iteration 456 took 56s (8.81% Gen, 87.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 27m 16s. Estimated total time: 15h 45m 35s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 33s, 500 more iterations: 7h 52m 47s. [2026-03-25 21:42:53,626][__main__][INFO] - Starting iteration 456. [2026-03-25 21:42:53,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:42:53,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:42:58,522][__main__][INFO] - Number of regex retries in iteration 456: 0 [2026-03-25 21:42:58,523][__main__][INFO] - agents played in iteration 456 are Bob, Alice [2026-03-25 21:42:59,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:42:59,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:42:59,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:42:59,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:42:59,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:43:00,419][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:43:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:43:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:43:02,577][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:43:03,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:43:04,013][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:43:04,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:43:05,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:43:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:43:06,887][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:43:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:43:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:43:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:43:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:43:10,480][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:43:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:43:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:43:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:43:13,356][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:43:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:43:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:43:15,514][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:43:16,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:43:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:43:17,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:43:18,392][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:43:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:43:19,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:43:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:43:21,272][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:43:21,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:43:22,712][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:43:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:43:24,151][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:43:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:43:25,593][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:43:26,313][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:43:27,034][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:43:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:43:28,474][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:43:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:43:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:43:30,636][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:43:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:43:32,078][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:43:32,798][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:43:33,519][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:43:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:43:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:43:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:43:36,710][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:43:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:43:38,155][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:43:38,877][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:43:39,601][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:43:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:43:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:43:41,770][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:43:42,491][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:43:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:43:43,935][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:43:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:43:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:43:46,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:43:46,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:43:47,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:43:47,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:43:47,943][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:43:49,302][__main__][INFO] - Iteration 457 took 55s (8.78% Gen, 88.77% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 8m 37s. Estimated total time: 15h 27m 52s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s. [2026-03-25 21:43:49,305][__main__][INFO] - Starting iteration 457. [2026-03-25 21:43:49,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:43:49,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:43:54,232][__main__][INFO] - Number of regex retries in iteration 457: 0 [2026-03-25 21:43:54,233][__main__][INFO] - agents played in iteration 457 are Bob, Alice [2026-03-25 21:43:54,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:43:54,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:43:54,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:43:54,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:43:55,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:43:56,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:43:56,852][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:43:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:43:58,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:43:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:43:59,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:44:00,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:44:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:44:01,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:44:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:44:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:44:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:44:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:44:05,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:44:06,196][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:44:06,915][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:44:07,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:44:08,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:44:09,075][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:44:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:44:10,515][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:44:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:44:11,955][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:44:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:44:13,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:44:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:44:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:44:15,556][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:44:16,275][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:44:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:44:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:44:18,437][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:44:19,158][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:44:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:44:20,600][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:44:21,321][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:44:22,042][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:44:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:44:23,482][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:44:24,203][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:44:24,927][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:44:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:44:26,370][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:44:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:44:27,816][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:44:28,539][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:44:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:44:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:44:30,950][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:44:31,674][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:44:32,399][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:44:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:44:33,847][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:44:34,571][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:44:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:44:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:44:36,741][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:44:37,464][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:44:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:44:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:44:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:44:40,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:44:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:44:41,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:44:42,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:44:43,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:44:43,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:44:43,682][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:44:45,025][__main__][INFO] - Iteration 458 took 55s (8.84% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 8m 27s. Estimated total time: 15h 28m 38s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 19s. [2026-03-25 21:44:45,028][__main__][INFO] - Starting iteration 458. [2026-03-25 21:44:45,031][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:44:45,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:44:50,124][__main__][INFO] - Number of regex retries in iteration 458: 0 [2026-03-25 21:44:50,125][__main__][INFO] - agents played in iteration 458 are Bob, Alice [2026-03-25 21:44:50,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:44:50,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:44:50,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:44:50,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:44:51,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:44:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:44:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:44:53,508][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:44:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:44:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:44:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:44:56,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:44:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:44:57,830][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:44:58,551][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:44:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:44:59,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:45:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:45:01,433][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:45:02,154][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:45:02,875][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:45:03,596][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:45:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:45:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:45:05,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:45:06,481][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:45:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:45:07,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:45:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:45:09,369][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:45:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:45:10,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:45:11,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:45:12,256][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:45:12,978][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:45:13,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:45:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:45:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:45:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:45:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:45:17,309][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:45:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:45:18,754][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:45:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:45:20,199][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:45:20,922][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:45:21,645][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:45:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:45:23,090][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:45:23,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:45:24,536][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:45:25,258][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:45:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:45:26,940][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:45:27,662][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:45:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:45:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:45:29,831][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:45:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:45:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:45:31,999][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:45:32,723][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:45:33,452][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:45:34,178][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:45:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:45:35,624][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:45:36,347][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:45:37,070][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:45:37,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:45:38,536][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:45:39,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:45:39,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:45:39,659][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:45:41,051][__main__][INFO] - Iteration 459 took 56s (9.09% Gen, 88.42% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 12m 34s. Estimated total time: 15h 33m 41s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 50s. [2026-03-25 21:45:41,053][__main__][INFO] - Starting iteration 459. [2026-03-25 21:45:41,058][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:45:41,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:45:46,257][__main__][INFO] - Number of regex retries in iteration 459: 0 [2026-03-25 21:45:46,258][__main__][INFO] - agents played in iteration 459 are Bob, Alice [2026-03-25 21:45:47,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:45:47,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:45:47,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:45:47,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:45:47,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:45:48,634][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:45:49,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:45:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:45:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:45:51,515][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:45:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:45:52,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:45:53,675][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:45:54,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:45:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:45:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:45:56,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:45:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:45:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:45:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:45:59,437][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:46:00,159][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:46:00,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:46:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:46:02,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:46:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:46:03,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:46:04,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:46:05,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:46:05,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:46:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:46:07,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:46:08,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:46:08,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:46:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:46:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:46:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:46:11,707][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:46:12,429][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:46:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:46:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:46:14,598][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:46:15,321][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:46:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:46:16,766][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:46:17,489][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:46:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:46:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:46:19,660][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:46:20,384][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:46:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:46:21,831][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:46:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:46:23,600][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:46:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:46:25,048][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:46:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:46:26,492][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:46:27,213][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:46:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:46:28,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:46:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:46:30,102][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:46:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:46:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:46:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:46:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:46:33,718][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:46:34,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:46:35,208][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:46:36,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:46:36,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:46:36,283][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:46:37,711][__main__][INFO] - Iteration 460 took 56s (9.18% Gen, 88.30% Train). Generation: 5s, Training: 50s. Estimated remaining time: 8h 22m 12s. Estimated total time: 15h 44m 15s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 25s, 500 more iterations: 7h 52m 7s. [2026-03-25 21:46:37,714][__main__][INFO] - Starting iteration 460. [2026-03-25 21:46:37,718][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:46:37,719][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:46:42,603][__main__][INFO] - Number of regex retries in iteration 460: 0 [2026-03-25 21:46:42,604][__main__][INFO] - agents played in iteration 460 are Bob, Alice [2026-03-25 21:46:43,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:46:43,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:46:43,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:46:43,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:46:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:46:44,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:46:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:46:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:46:46,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:46:47,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:46:48,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:46:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:46:49,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:46:50,275][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:46:50,995][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:46:51,714][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:46:52,434][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:46:53,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:46:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:46:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:46:55,310][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:46:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:46:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:46:57,470][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:46:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:46:58,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:46:59,630][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:47:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:47:01,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:47:01,791][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:47:02,510][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:47:03,231][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:47:03,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:47:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:47:05,392][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:47:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:47:06,834][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:47:07,554][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:47:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:47:08,996][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:47:09,718][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:47:10,437][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:47:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:47:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:47:12,604][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:47:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:47:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:47:14,769][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:47:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:47:16,210][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:47:16,932][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:47:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:47:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:47:19,317][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:47:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:47:20,760][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:47:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:47:22,204][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:47:22,926][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:47:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:47:24,370][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:47:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:47:25,813][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:47:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:47:27,257][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:47:27,979][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:47:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:47:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:47:30,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:47:30,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:47:32,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:47:32,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:47:32,154][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:47:33,645][__main__][INFO] - Iteration 461 took 55s (8.74% Gen, 88.59% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 9m 9s. Estimated total time: 15h 32m 8s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 4s. [2026-03-25 21:47:33,647][__main__][INFO] - Starting iteration 461. [2026-03-25 21:47:33,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:47:33,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:47:38,570][__main__][INFO] - Number of regex retries in iteration 461: 0 [2026-03-25 21:47:38,571][__main__][INFO] - agents played in iteration 461 are Bob, Alice [2026-03-25 21:47:39,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:47:39,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:47:39,142][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:47:39,143][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:47:39,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:47:40,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:47:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:47:41,922][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:47:42,640][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:47:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:47:44,076][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:47:44,795][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:47:45,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:47:46,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:47:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:47:47,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:47:48,389][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:47:49,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:47:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:47:50,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:47:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:47:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:47:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:47:53,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:47:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:47:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:47:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:47:56,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:47:57,029][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:47:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:47:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:47:59,191][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:47:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:48:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:48:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:48:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:48:02,791][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:48:03,513][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:48:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:48:04,952][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:48:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:48:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:48:07,113][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:48:07,835][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:48:08,553][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:48:09,276][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:48:09,998][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:48:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:48:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:48:12,158][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:48:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:48:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:48:14,562][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:48:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:48:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:48:16,725][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:48:17,446][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:48:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:48:18,888][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:48:19,609][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:48:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:48:21,051][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:48:21,773][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:48:22,493][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:48:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:48:23,936][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:48:24,658][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:48:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:48:26,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:48:26,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:48:27,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:48:27,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:48:27,952][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:48:29,315][__main__][INFO] - Iteration 462 took 55s (8.84% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 3m 50s. Estimated total time: 15h 27m 45s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 52s. [2026-03-25 21:48:29,318][__main__][INFO] - Starting iteration 462. [2026-03-25 21:48:29,322][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:48:29,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:48:34,354][__main__][INFO] - Number of regex retries in iteration 462: 0 [2026-03-25 21:48:34,355][__main__][INFO] - agents played in iteration 462 are Bob, Alice [2026-03-25 21:48:34,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:48:34,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:48:34,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:48:34,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:48:35,609][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:48:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:48:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:48:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:48:38,416][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:48:39,135][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:48:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:48:40,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:48:41,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:48:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:48:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:48:43,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:48:44,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:48:44,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:48:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:48:46,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:48:47,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:48:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:48:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:48:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:48:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:48:50,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:48:51,364][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:48:52,084][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:48:52,804][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:48:53,522][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:48:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:48:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:48:55,683][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:48:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:48:57,123][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:48:57,843][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:48:58,565][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:48:59,283][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:49:00,004][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:49:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:49:01,444][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:49:02,165][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:49:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:49:03,605][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:49:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:49:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:49:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:49:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:49:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:49:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:49:08,652][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:49:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:49:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:49:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:49:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:49:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:49:13,281][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:49:14,002][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:49:14,725][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:49:15,445][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:49:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:49:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:49:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:49:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:49:19,055][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:49:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:49:20,496][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:49:21,218][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:49:21,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:49:22,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:49:23,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:49:23,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:49:23,815][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:49:25,172][__main__][INFO] - Iteration 463 took 55s (9.01% Gen, 88.55% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 6m 1s. Estimated total time: 15h 30m 52s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 26s. [2026-03-25 21:49:25,175][__main__][INFO] - Starting iteration 463. [2026-03-25 21:49:25,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:49:25,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:49:32,557][__main__][INFO] - Number of regex retries in iteration 463: 0 [2026-03-25 21:49:32,558][__main__][INFO] - agents played in iteration 463 are Bob, Alice [2026-03-25 21:49:33,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:49:33,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:49:33,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:49:33,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:49:33,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:49:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:49:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:49:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:49:36,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:49:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:49:38,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:49:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:49:39,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:49:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:49:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:49:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:49:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:49:43,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:49:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:49:44,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:49:45,230][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:49:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:49:46,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:49:47,386][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:49:48,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:49:48,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:49:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:49:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:49:50,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:49:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:49:52,417][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:49:53,135][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:49:53,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:49:54,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:49:55,293][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:49:56,013][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:49:56,731][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:49:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:49:58,172][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:49:58,890][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:49:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:50:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:50:01,049][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:50:01,770][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:50:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:50:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:50:03,929][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:50:04,648][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:50:05,369][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:50:06,089][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:50:06,808][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:50:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:50:08,488][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:50:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:50:09,930][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:50:10,650][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:50:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:50:12,090][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:50:12,813][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:50:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:50:14,253][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:50:14,975][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:50:15,697][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:50:16,416][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:50:17,137][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:50:17,858][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:50:18,580][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:50:19,300][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:50:20,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:50:20,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:50:22,048][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:50:22,052][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:50:22,054][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:50:23,603][__main__][INFO] - Iteration 464 took 58s (12.63% Gen, 84.71% Train). Generation: 7s, Training: 49s. Estimated remaining time: 8h 47m 57s. Estimated total time: 16h 13m 46s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 22s, 500 more iterations: 8h 6m 53s. [2026-03-25 21:50:23,606][__main__][INFO] - Starting iteration 464. [2026-03-25 21:50:23,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:50:23,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:50:29,350][__main__][INFO] - Number of regex retries in iteration 464: 0 [2026-03-25 21:50:29,351][__main__][INFO] - agents played in iteration 464 are Bob, Alice [2026-03-25 21:50:29,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:50:29,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:50:29,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:50:29,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:50:30,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:50:31,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:50:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:50:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:50:33,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:50:34,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:50:34,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:50:35,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:50:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:50:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:50:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:50:38,474][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:50:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:50:39,915][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:50:40,635][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:50:41,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:50:42,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:50:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:50:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:50:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:50:44,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:50:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:50:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:50:47,117][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:50:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:50:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:50:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:50:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:50:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:50:51,441][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:50:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:50:52,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:50:53,607][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:50:54,327][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:50:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:50:55,770][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:50:56,491][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:50:57,211][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:50:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:50:58,655][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:50:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:51:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:51:00,819][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:51:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:51:02,260][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:51:02,981][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:51:03,702][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:51:04,421][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:51:05,378][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:51:06,098][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:51:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:51:07,539][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:51:08,258][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:51:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:51:09,701][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:51:10,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:51:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:51:11,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:51:12,583][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:51:13,304][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:51:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:51:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:51:15,467][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:51:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:51:16,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:51:17,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:51:18,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:51:18,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:51:18,856][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:51:20,230][__main__][INFO] - Iteration 465 took 56s (10.14% Gen, 87.43% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 16m 56s. Estimated total time: 15h 43m 42s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 22s, 500 more iterations: 7h 51m 51s. [2026-03-25 21:51:20,233][__main__][INFO] - Starting iteration 465. [2026-03-25 21:51:20,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:51:20,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:51:25,200][__main__][INFO] - Number of regex retries in iteration 465: 0 [2026-03-25 21:51:25,201][__main__][INFO] - agents played in iteration 465 are Bob, Alice [2026-03-25 21:51:25,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:51:25,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:51:25,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:51:25,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:51:26,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:51:27,116][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:51:27,836][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:51:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:51:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:51:29,990][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:51:30,708][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:51:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:51:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:51:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:51:33,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:51:34,305][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:51:35,024][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:51:35,744][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:51:36,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:51:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:51:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:51:38,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:51:39,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:51:40,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:51:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:51:41,499][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:51:42,219][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:51:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:51:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:51:44,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:51:45,097][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:51:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:51:46,538][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:51:47,257][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:51:47,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:51:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:51:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:51:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:51:50,862][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:51:51,585][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:51:52,305][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:51:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:51:53,746][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:51:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:51:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:51:55,906][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:51:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:51:57,352][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:51:58,072][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:51:58,793][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:51:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:52:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:52:01,263][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:52:01,984][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:52:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:52:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:52:04,146][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:52:04,865][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:52:05,586][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:52:06,307][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:52:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:52:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:52:08,471][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:52:09,192][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:52:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:52:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:52:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:52:12,077][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:52:12,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:52:13,549][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:52:14,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:52:14,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:52:14,645][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:52:17,446][__main__][INFO] - Iteration 466 took 57s (8.68% Gen, 86.42% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 25m 48s. Estimated total time: 15h 53m 31s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 21s, 500 more iterations: 7h 56m 45s. [2026-03-25 21:52:17,450][__main__][INFO] - Starting iteration 466. [2026-03-25 21:52:17,455][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:52:17,456][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:52:22,413][__main__][INFO] - Number of regex retries in iteration 466: 0 [2026-03-25 21:52:22,414][__main__][INFO] - agents played in iteration 466 are Bob, Alice [2026-03-25 21:52:22,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:52:23,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:52:23,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:52:23,029][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:52:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:52:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:52:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:52:25,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:52:26,530][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:52:27,250][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:52:27,968][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:52:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:52:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:52:30,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:52:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:52:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:52:32,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:52:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:52:33,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:52:34,437][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:52:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:52:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:52:36,594][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:52:37,313][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:52:38,032][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:52:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:52:39,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:52:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:52:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:52:41,629][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:52:42,349][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:52:43,069][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:52:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:52:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:52:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:52:45,945][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:52:46,664][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:52:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:52:48,104][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:52:48,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:52:49,544][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:52:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:52:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:52:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:52:52,423][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:52:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:52:53,864][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:52:54,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:52:55,303][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:52:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:52:56,744][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:52:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:52:58,422][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:52:59,145][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:52:59,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:53:00,585][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:53:01,306][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:53:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:53:02,747][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:53:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:53:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:53:04,908][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:53:05,628][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:53:06,350][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:53:07,071][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:53:07,792][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:53:08,512][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:53:09,235][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:53:09,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:53:10,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:53:11,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:53:11,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:53:11,779][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:53:13,139][__main__][INFO] - Iteration 467 took 55s (8.90% Gen, 88.65% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 59m 27s. Estimated total time: 15h 28m 6s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 48s, 500 more iterations: 7h 44m 3s. [2026-03-25 21:53:13,141][__main__][INFO] - Starting iteration 467. [2026-03-25 21:53:13,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:53:13,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:53:18,110][__main__][INFO] - Number of regex retries in iteration 467: 0 [2026-03-25 21:53:18,111][__main__][INFO] - agents played in iteration 467 are Bob, Alice [2026-03-25 21:53:18,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:53:18,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:53:18,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:53:18,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:53:19,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:53:20,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:53:20,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:53:21,550][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:53:22,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:53:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:53:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:53:24,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:53:25,142][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:53:25,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:53:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:53:27,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:53:28,015][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:53:28,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:53:29,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:53:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:53:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:53:31,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:53:32,325][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:53:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:53:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:53:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:53:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:53:35,921][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:53:36,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:53:37,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:53:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:53:38,800][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:53:39,519][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:53:40,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:53:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:53:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:53:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:53:43,117][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:53:43,835][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:53:44,554][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:53:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:53:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:53:46,714][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:53:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:53:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:53:48,874][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:53:49,595][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:53:50,316][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:53:51,035][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:53:51,756][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:53:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:53:53,197][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:53:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:53:54,872][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:53:55,591][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:53:56,312][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:53:57,033][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:53:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:53:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:53:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:53:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:54:00,636][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:54:01,357][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:54:02,080][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:54:02,799][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:54:03,520][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:54:04,242][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:54:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:54:05,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:54:06,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:54:07,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:54:07,540][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:54:07,542][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:54:08,935][__main__][INFO] - Iteration 468 took 55s (8.90% Gen, 88.60% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 0m 16s. Estimated total time: 15h 29m 51s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 59s, 500 more iterations: 7h 44m 55s. [2026-03-25 21:54:08,937][__main__][INFO] - Starting iteration 468. [2026-03-25 21:54:08,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:54:08,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:54:14,004][__main__][INFO] - Number of regex retries in iteration 468: 0 [2026-03-25 21:54:14,005][__main__][INFO] - agents played in iteration 468 are Bob, Alice [2026-03-25 21:54:14,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:54:14,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:54:14,590][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:54:14,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:54:15,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:54:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:54:16,658][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:54:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:54:18,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:54:18,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:54:19,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:54:20,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:54:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:54:21,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:54:22,408][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:54:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:54:23,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:54:24,565][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:54:25,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:54:26,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:54:26,723][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:54:27,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:54:28,163][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:54:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:54:29,602][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:54:30,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:54:31,044][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:54:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:54:32,483][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:54:33,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:54:33,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:54:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:54:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:54:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:54:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:54:37,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:54:38,242][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:54:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:54:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:54:40,403][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:54:41,122][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:54:41,842][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:54:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:54:43,283][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:54:44,002][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:54:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:54:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:54:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:54:46,885][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:54:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:54:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:54:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:54:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:54:50,796][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:54:51,515][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:54:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:54:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:54:53,681][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:54:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:54:55,123][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:54:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:54:56,566][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:54:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:54:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:54:58,729][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:54:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:55:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:55:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:55:01,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:55:02,381][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:55:03,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:55:03,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:55:03,559][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:55:04,893][__main__][INFO] - Iteration 469 took 55s (9.05% Gen, 88.56% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 2m 3s. Estimated total time: 15h 32m 33s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 16s. [2026-03-25 21:55:04,896][__main__][INFO] - Starting iteration 469. [2026-03-25 21:55:04,900][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:55:04,901][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:55:10,589][__main__][INFO] - Number of regex retries in iteration 469: 0 [2026-03-25 21:55:10,591][__main__][INFO] - agents played in iteration 469 are Bob, Alice [2026-03-25 21:55:11,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:55:11,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:55:11,156][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:55:11,157][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:55:11,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:55:12,517][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:55:13,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:55:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:55:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:55:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:55:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:55:16,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:55:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:55:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:55:18,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:55:19,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:55:20,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:55:21,141][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:55:21,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:55:22,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:55:23,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:55:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:55:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:55:25,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:55:26,172][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:55:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:55:27,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:55:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:55:29,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:55:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:55:30,489][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:55:31,209][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:55:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:55:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:55:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:55:34,088][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:55:34,810][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:55:35,529][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:55:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:55:36,970][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:55:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:55:38,409][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:55:39,130][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:55:39,850][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:55:40,571][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:55:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:55:42,013][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:55:42,735][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:55:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:55:44,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:55:44,897][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:55:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:55:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:55:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:55:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:55:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:55:49,468][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:55:50,189][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:55:50,909][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:55:51,630][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:55:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:55:53,073][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:55:53,793][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:55:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:55:55,235][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:55:55,955][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:55:56,676][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:55:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:55:58,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:55:58,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:56:00,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:56:00,111][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:56:00,113][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:56:01,587][__main__][INFO] - Iteration 470 took 56s (10.04% Gen, 87.36% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 13m 20s. Estimated total time: 15h 44m 48s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 28s, 500 more iterations: 7h 52m 24s. [2026-03-25 21:56:01,590][__main__][INFO] - Starting iteration 470. [2026-03-25 21:56:01,596][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:56:01,597][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:56:09,053][__main__][INFO] - Number of regex retries in iteration 470: 0 [2026-03-25 21:56:09,054][__main__][INFO] - agents played in iteration 470 are Bob, Alice [2026-03-25 21:56:09,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:56:09,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:56:09,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:56:09,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:56:10,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:56:10,973][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:56:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:56:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:56:13,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:56:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:56:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:56:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:56:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:56:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:56:17,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:56:18,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:56:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:56:19,586][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:56:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:56:21,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:56:21,741][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:56:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:56:23,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:56:23,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:56:24,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:56:25,332][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:56:26,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:56:26,771][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:56:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:56:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:56:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:56:29,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:56:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:56:31,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:56:31,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:56:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:56:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:56:33,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:56:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:56:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:56:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:56:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:56:37,557][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:56:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:56:38,997][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:56:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:56:40,438][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:56:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:56:41,877][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:56:42,598][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:56:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:56:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:56:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:56:45,712][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:56:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:56:47,150][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:56:47,873][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:56:48,591][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:56:49,312][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:56:50,032][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:56:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:56:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:56:52,194][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:56:52,914][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:56:53,635][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:56:54,356][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:56:55,076][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:56:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:56:56,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:56:57,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:56:58,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:56:58,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:56:58,459][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:56:59,994][__main__][INFO] - Iteration 471 took 58s (12.77% Gen, 84.60% Train). Generation: 7s, Training: 49s. Estimated remaining time: 8h 40m 54s. Estimated total time: 16h 13m 20s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 20s, 500 more iterations: 8h 6m 40s. [2026-03-25 21:56:59,997][__main__][INFO] - Starting iteration 471. [2026-03-25 21:57:00,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:57:00,001][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:57:11,162][__main__][INFO] - Number of regex retries in iteration 471: 0 [2026-03-25 21:57:11,163][__main__][INFO] - agents played in iteration 471 are Bob, Alice [2026-03-25 21:57:11,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:57:11,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:57:11,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:57:11,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:57:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:57:13,061][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:57:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:57:14,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:57:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:57:15,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:57:16,646][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:57:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:57:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:57:18,793][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:57:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:57:20,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:57:20,943][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:57:21,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:57:22,375][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:57:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:57:23,810][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:57:24,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:57:25,244][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:57:25,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:57:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:57:27,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:57:28,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:57:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:57:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:57:30,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:57:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:57:31,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:57:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:57:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:57:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:57:34,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:57:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:57:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:57:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:57:37,447][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:57:38,163][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:57:38,884][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:57:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:57:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:57:41,041][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:57:41,759][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:57:42,478][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:57:43,196][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:57:43,914][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:57:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:57:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:57:46,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:57:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:57:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:57:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:57:49,264][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:57:49,983][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:57:50,702][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:57:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:57:52,141][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:57:52,861][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:57:53,580][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:57:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:57:55,023][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:57:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:57:56,461][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:57:57,181][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:57:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:57:58,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:57:59,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:58:00,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:58:00,430][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:58:00,431][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:58:01,765][__main__][INFO] - Iteration 472 took 1m 1s (18.07% Gen, 79.77% Train). Generation: 11s, Training: 49s. Estimated remaining time: 9h 35m 58s. Estimated total time: 17h 9m 26s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 56s, 500 more iterations: 8h 34m 43s. [2026-03-25 21:58:01,767][__main__][INFO] - Starting iteration 472. [2026-03-25 21:58:01,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:58:01,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:58:07,181][__main__][INFO] - Number of regex retries in iteration 472: 0 [2026-03-25 21:58:07,182][__main__][INFO] - agents played in iteration 472 are Bob, Alice [2026-03-25 21:58:07,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:58:07,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:58:07,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:58:07,752][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:58:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:58:09,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:58:09,821][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:58:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:58:11,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:58:11,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:58:12,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:58:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:58:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:58:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:58:15,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:58:16,282][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:58:17,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:58:17,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:58:18,437][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:58:19,154][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:58:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:58:20,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:58:21,309][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:58:22,028][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:58:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:58:23,465][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:58:24,184][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:58:24,902][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:58:25,623][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:58:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:58:27,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:58:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:58:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:58:29,219][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:58:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:58:30,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:58:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:58:32,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:58:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:58:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:58:34,258][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:58:34,977][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:58:35,697][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:58:36,418][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:58:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:58:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:58:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:58:39,301][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:58:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:58:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:58:41,462][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:58:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:58:43,139][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:58:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:58:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:58:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:58:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:58:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:58:47,466][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:58:48,187][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:58:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:58:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:58:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:58:51,070][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:58:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:58:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:58:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:58:53,954][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:58:54,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:58:55,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 21:58:56,439][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:58:56,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:58:56,444][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:58:57,805][__main__][INFO] - Iteration 473 took 56s (9.65% Gen, 87.91% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 59m 31s. Estimated total time: 15h 33m 55s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 57s. [2026-03-25 21:58:57,807][__main__][INFO] - Starting iteration 473. [2026-03-25 21:58:57,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:58:57,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 21:59:02,737][__main__][INFO] - Number of regex retries in iteration 473: 0 [2026-03-25 21:59:02,738][__main__][INFO] - agents played in iteration 473 are Bob, Alice [2026-03-25 21:59:03,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:59:03,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 21:59:03,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 21:59:03,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 21:59:04,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 21:59:04,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 21:59:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 21:59:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 21:59:06,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 21:59:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 21:59:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 21:59:09,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 21:59:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 21:59:10,470][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 21:59:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 21:59:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 21:59:12,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 21:59:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 21:59:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 21:59:14,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 21:59:15,505][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 21:59:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 21:59:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 21:59:17,664][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 21:59:18,383][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 21:59:19,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 21:59:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 21:59:20,542][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 21:59:21,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 21:59:21,982][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 21:59:22,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 21:59:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 21:59:24,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 21:59:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 21:59:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 21:59:26,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 21:59:27,026][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 21:59:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 21:59:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 21:59:29,188][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 21:59:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 21:59:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 21:59:31,349][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 21:59:32,071][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 21:59:32,792][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 21:59:33,513][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 21:59:34,232][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 21:59:34,953][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 21:59:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 21:59:36,395][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 21:59:37,116][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 21:59:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 21:59:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 21:59:39,516][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 21:59:40,236][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 21:59:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 21:59:41,679][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 21:59:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 21:59:43,122][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 21:59:43,845][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 21:59:44,565][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 21:59:45,286][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 21:59:46,008][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 21:59:46,731][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 21:59:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 21:59:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 21:59:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 21:59:49,617][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 21:59:50,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 21:59:51,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 21:59:52,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 21:59:52,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 21:59:52,568][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 21:59:57,224][__main__][INFO] - Iteration 474 took 59s (8.29% Gen, 83.87% Train). Generation: 4s, Training: 49s. Estimated remaining time: 8h 54m 52s. Estimated total time: 16h 30m 14s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 1s, 500 more iterations: 8h 15m 7s. [2026-03-25 21:59:57,228][__main__][INFO] - Starting iteration 474. [2026-03-25 21:59:57,235][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 21:59:57,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:00:02,257][__main__][INFO] - Number of regex retries in iteration 474: 0 [2026-03-25 22:00:02,258][__main__][INFO] - agents played in iteration 474 are Bob, Alice [2026-03-25 22:00:02,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:00:02,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:00:02,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:00:02,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:00:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:00:04,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:00:04,946][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:00:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:00:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:00:07,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:00:07,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:00:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:00:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:00:09,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:00:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:00:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:00:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:00:12,842][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:00:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:00:14,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:00:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:00:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:00:16,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:00:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:00:17,874][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:00:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:00:19,312][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:00:20,029][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:00:20,750][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:00:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:00:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:00:22,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:00:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:00:24,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:00:25,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:00:25,789][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:00:26,508][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:00:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:00:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:00:28,668][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:00:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:00:30,109][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:00:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:00:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:00:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:00:32,991][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:00:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:00:34,431][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:00:35,152][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:00:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:00:36,592][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:00:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:00:38,333][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:00:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:00:39,776][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:00:40,498][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:00:41,219][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:00:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:00:42,660][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:00:43,382][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:00:44,101][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:00:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:00:45,544][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:00:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:00:46,986][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:00:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:00:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:00:49,150][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:00:49,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:00:50,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:00:51,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:00:51,832][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:00:51,834][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:00:53,216][__main__][INFO] - Iteration 475 took 55s (8.97% Gen, 88.55% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 56m 45s. Estimated total time: 15h 33m 4s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 32s. [2026-03-25 22:00:53,219][__main__][INFO] - Starting iteration 475. [2026-03-25 22:00:53,225][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:00:53,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:00:59,108][__main__][INFO] - Number of regex retries in iteration 475: 0 [2026-03-25 22:00:59,109][__main__][INFO] - agents played in iteration 475 are Bob, Alice [2026-03-25 22:00:59,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:00:59,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:00:59,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:00:59,919][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:01:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:01:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:01:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:01:02,721][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:01:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:01:04,157][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:01:04,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:01:05,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:01:06,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:01:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:01:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:01:08,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:01:09,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:01:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:01:10,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:01:11,342][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:01:12,061][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:01:12,780][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:01:13,498][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:01:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:01:14,937][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:01:15,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:01:16,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:01:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:01:17,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:01:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:01:19,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:01:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:01:20,695][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:01:21,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:01:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:01:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:01:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:01:24,294][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:01:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:01:25,734][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:01:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:01:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:01:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:01:28,615][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:01:29,334][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:01:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:01:30,775][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:01:31,496][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:01:32,216][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:01:32,938][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:01:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:01:34,378][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:01:35,344][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:01:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:01:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:01:37,508][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:01:38,229][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:01:38,951][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:01:39,671][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:01:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:01:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:01:41,834][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:01:42,554][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:01:43,275][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:01:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:01:44,718][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:01:45,440][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:01:46,162][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:01:46,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:01:47,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:01:48,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:01:48,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:01:48,650][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:01:49,985][__main__][INFO] - Iteration 476 took 56s (10.37% Gen, 87.28% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 8m 46s. Estimated total time: 15h 46m 1s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 36s, 500 more iterations: 7h 53m 0s. [2026-03-25 22:01:49,989][__main__][INFO] - Starting iteration 476. [2026-03-25 22:01:49,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:01:49,994][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:01:59,095][__main__][INFO] - Number of regex retries in iteration 476: 0 [2026-03-25 22:01:59,096][__main__][INFO] - agents played in iteration 476 are Bob, Alice [2026-03-25 22:01:59,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:01:59,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:01:59,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:01:59,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:02:00,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:02:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:02:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:02:02,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:02:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:02:03,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:02:04,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:02:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:02:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:02:06,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:02:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:02:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:02:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:02:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:02:10,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:02:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:02:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:02:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:02:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:02:13,934][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:02:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:02:15,370][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:02:16,089][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:02:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:02:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:02:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:02:18,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:02:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:02:20,402][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:02:21,122][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:02:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:02:22,559][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:02:23,279][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:02:23,998][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:02:24,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:02:25,437][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:02:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:02:26,875][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:02:27,595][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:02:28,316][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:02:29,034][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:02:29,757][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:02:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:02:31,196][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:02:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:02:32,635][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:02:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:02:34,078][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:02:35,033][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:02:35,755][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:02:36,475][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:02:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:02:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:02:38,637][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:02:39,359][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:02:40,079][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:02:40,800][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:02:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:02:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:02:42,966][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:02:43,687][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:02:44,408][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:02:45,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:02:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:02:46,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:02:47,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:02:48,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:02:48,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:02:48,419][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:02:49,826][__main__][INFO] - Iteration 477 took 59s (15.21% Gen, 82.43% Train). Generation: 9s, Training: 49s. Estimated remaining time: 8h 58m 58s. Estimated total time: 16h 37m 14s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 43s, 500 more iterations: 8h 18m 37s. [2026-03-25 22:02:49,828][__main__][INFO] - Starting iteration 477. [2026-03-25 22:02:49,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:02:49,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:02:57,741][__main__][INFO] - Number of regex retries in iteration 477: 0 [2026-03-25 22:02:57,742][__main__][INFO] - agents played in iteration 477 are Bob, Alice [2026-03-25 22:02:58,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:02:58,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:02:58,304][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:02:58,305][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:02:59,017][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:02:59,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:03:00,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:03:01,103][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:03:01,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:03:02,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:03:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:03:03,977][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:03:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:03:05,415][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:03:06,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:03:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:03:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:03:08,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:03:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:03:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:03:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:03:11,171][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:03:11,890][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:03:12,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:03:13,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:03:14,052][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:03:14,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:03:15,492][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:03:16,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:03:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:03:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:03:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:03:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:03:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:03:20,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:03:21,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:03:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:03:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:03:23,415][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:03:24,137][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:03:24,859][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:03:25,580][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:03:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:03:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:03:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:03:28,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:03:29,183][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:03:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:03:30,624][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:03:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:03:32,068][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:03:32,789][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:03:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:03:34,539][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:03:35,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:03:35,980][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:03:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:03:37,423][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:03:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:03:38,866][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:03:39,588][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:03:40,309][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:03:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:03:41,754][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:03:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:03:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:03:43,918][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:03:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:03:45,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:03:46,137][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:03:47,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:03:47,209][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:03:47,211][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:03:48,497][__main__][INFO] - Iteration 478 took 58s (13.48% Gen, 84.32% Train). Generation: 7s, Training: 49s. Estimated remaining time: 8h 38m 33s. Estimated total time: 16h 17m 47s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 46s, 500 more iterations: 8h 8m 53s. [2026-03-25 22:03:48,500][__main__][INFO] - Starting iteration 478. [2026-03-25 22:03:48,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:03:48,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:03:54,000][__main__][INFO] - Number of regex retries in iteration 478: 0 [2026-03-25 22:03:54,002][__main__][INFO] - agents played in iteration 478 are Bob, Alice [2026-03-25 22:03:54,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:03:54,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:03:54,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:03:54,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:03:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:03:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:03:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:03:57,638][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:03:58,357][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:03:59,075][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:03:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:04:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:04:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:04:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:04:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:04:03,394][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:04:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:04:04,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:04:05,553][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:04:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:04:06,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:04:07,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:04:08,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:04:09,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:04:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:04:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:04:11,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:04:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:04:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:04:13,480][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:04:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:04:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:04:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:04:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:04:17,084][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:04:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:04:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:04:19,248][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:04:19,969][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:04:20,691][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:04:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:04:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:04:22,853][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:04:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:04:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:04:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:04:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:04:26,462][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:04:27,185][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:04:27,906][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:04:28,629][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:04:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:04:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:04:31,042][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:04:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:04:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:04:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:04:33,930][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:04:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:04:35,373][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:04:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:04:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:04:37,539][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:04:38,262][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:04:38,983][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:04:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:04:40,428][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:04:41,151][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:04:41,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:04:42,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:04:43,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:04:43,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:04:43,798][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:04:45,370][__main__][INFO] - Iteration 479 took 56s (9.67% Gen, 87.56% Train). Generation: 5s, Training: 49s. Estimated remaining time: 8h 7m 37s. Estimated total time: 15h 47m 48s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 46s, 500 more iterations: 7h 53m 54s. [2026-03-25 22:04:45,373][__main__][INFO] - Starting iteration 479. [2026-03-25 22:04:45,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:04:45,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:04:50,312][__main__][INFO] - Number of regex retries in iteration 479: 0 [2026-03-25 22:04:50,313][__main__][INFO] - agents played in iteration 479 are Bob, Alice [2026-03-25 22:04:50,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:04:50,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:04:50,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:04:50,896][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:04:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:04:52,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:04:52,996][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:04:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:04:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:04:55,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:04:55,872][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:04:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:04:57,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:04:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:04:58,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:04:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:05:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:05:00,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:05:01,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:05:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:05:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:05:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:05:04,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:05:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:05:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:05:06,682][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:05:07,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:05:08,124][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:05:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:05:09,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:05:10,290][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:05:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:05:11,731][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:05:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:05:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:05:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:05:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:05:15,337][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:05:16,056][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:05:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:05:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:05:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:05:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:05:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:05:20,391][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:05:21,114][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:05:21,837][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:05:22,562][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:05:23,283][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:05:24,006][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:05:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:05:25,453][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:05:26,414][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:05:27,137][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:05:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:05:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:05:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:05:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:05:30,753][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:05:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:05:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:05:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:05:33,645][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:05:34,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:05:35,091][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:05:35,813][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:05:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:05:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:05:37,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:05:38,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:05:39,820][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:05:39,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:05:39,824][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:05:41,194][__main__][INFO] - Iteration 480 took 55s (8.84% Gen, 88.70% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 49m 12s. Estimated total time: 15h 30m 19s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 9s. [2026-03-25 22:05:41,196][__main__][INFO] - Starting iteration 480. [2026-03-25 22:05:41,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:05:41,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:05:46,086][__main__][INFO] - Number of regex retries in iteration 480: 0 [2026-03-25 22:05:46,087][__main__][INFO] - agents played in iteration 480 are Bob, Alice [2026-03-25 22:05:46,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:05:46,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:05:46,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:05:46,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:05:47,401][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:05:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:05:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:05:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:05:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:05:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:05:51,646][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:05:52,365][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:05:53,085][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:05:53,804][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:05:54,524][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:05:55,243][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:05:55,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:05:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:05:57,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:05:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:05:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:05:59,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:06:00,282][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:06:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:06:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:06:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:06:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:06:03,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:06:04,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:06:05,326][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:06:06,046][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:06:06,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:06:07,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:06:08,207][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:06:08,929][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:06:09,652][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:06:10,372][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:06:11,093][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:06:11,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:06:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:06:13,257][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:06:13,978][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:06:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:06:15,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:06:16,144][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:06:16,864][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:06:17,586][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:06:18,308][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:06:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:06:19,751][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:06:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:06:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:06:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:06:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:06:23,654][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:06:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:06:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:06:25,819][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:06:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:06:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:06:27,986][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:06:28,708][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:06:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:06:30,152][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:06:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:06:31,598][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:06:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:06:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:06:33,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:06:34,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:06:35,696][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:06:35,699][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:06:35,700][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:06:37,055][__main__][INFO] - Iteration 481 took 55s (8.75% Gen, 88.82% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 48m 54s. Estimated total time: 15h 30m 57s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 28s. [2026-03-25 22:06:37,059][__main__][INFO] - Starting iteration 481. [2026-03-25 22:06:37,064][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:06:37,065][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:06:42,152][__main__][INFO] - Number of regex retries in iteration 481: 0 [2026-03-25 22:06:42,153][__main__][INFO] - agents played in iteration 481 are Bob, Alice [2026-03-25 22:06:42,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:06:42,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:06:42,831][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:06:42,831][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:06:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:06:44,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:06:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:06:45,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:06:46,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:06:47,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:06:47,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:06:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:06:49,206][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:06:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:06:50,645][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:06:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:06:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:06:52,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:06:53,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:06:54,246][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:06:54,968][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:06:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:06:56,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:06:57,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:06:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:06:58,568][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:06:59,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:07:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:07:00,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:07:01,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:07:02,171][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:07:02,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:07:03,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:07:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:07:05,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:07:05,778][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:07:06,499][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:07:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:07:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:07:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:07:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:07:10,108][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:07:10,829][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:07:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:07:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:07:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:07:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:07:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:07:15,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:07:15,884][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:07:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:07:17,327][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:07:18,298][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:07:19,020][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:07:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:07:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:07:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:07:21,910][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:07:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:07:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:07:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:07:24,804][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:07:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:07:26,250][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:07:26,971][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:07:27,693][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:07:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:07:29,141][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:07:29,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:07:30,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:07:31,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:07:31,902][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:07:31,904][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:07:33,341][__main__][INFO] - Iteration 482 took 56s (9.04% Gen, 88.40% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 55m 0s. Estimated total time: 15h 37m 59s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 59s. [2026-03-25 22:07:33,344][__main__][INFO] - Starting iteration 482. [2026-03-25 22:07:33,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:07:33,348][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:07:38,240][__main__][INFO] - Number of regex retries in iteration 482: 0 [2026-03-25 22:07:38,241][__main__][INFO] - agents played in iteration 482 are Bob, Alice [2026-03-25 22:07:38,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:07:38,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:07:38,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:07:38,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:07:39,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:07:40,215][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:07:40,938][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:07:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:07:42,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:07:43,098][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:07:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:07:44,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:07:45,257][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:07:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:07:46,697][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:07:47,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:07:48,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:07:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:07:49,581][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:07:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:07:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:07:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:07:52,465][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:07:53,185][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:07:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:07:54,629][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:07:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:07:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:07:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:07:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:07:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:07:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:07:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:08:00,405][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:08:01,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:08:01,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:08:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:08:03,293][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:08:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:08:04,738][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:08:05,460][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:08:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:08:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:08:07,630][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:08:08,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:08:09,077][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:08:09,800][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:08:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:08:11,247][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:08:11,970][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:08:12,693][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:08:13,416][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:08:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:08:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:08:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:08:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:08:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:08:17,981][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:08:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:08:19,424][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:08:20,150][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:08:20,871][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:08:21,595][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:08:22,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:08:23,042][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:08:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:08:24,487][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:08:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:08:25,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:08:26,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:08:27,704][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:08:27,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:08:27,708][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:08:29,168][__main__][INFO] - Iteration 483 took 55s (8.77% Gen, 88.61% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 46m 27s. Estimated total time: 15h 30m 22s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 11s. [2026-03-25 22:08:29,171][__main__][INFO] - Starting iteration 483. [2026-03-25 22:08:29,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:08:29,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:08:34,389][__main__][INFO] - Number of regex retries in iteration 483: 0 [2026-03-25 22:08:34,390][__main__][INFO] - agents played in iteration 483 are Bob, Alice [2026-03-25 22:08:35,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:08:35,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:08:35,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:08:35,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:08:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:08:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:08:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:08:38,008][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:08:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:08:39,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:08:40,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:08:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:08:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:08:42,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:08:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:08:43,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:08:44,493][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:08:45,213][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:08:45,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:08:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:08:47,377][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:08:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:08:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:08:49,540][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:08:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:08:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:08:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:08:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:08:53,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:08:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:08:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:08:55,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:08:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:08:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:08:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:08:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:08:58,925][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:08:59,650][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:09:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:09:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:09:01,818][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:09:02,541][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:09:03,265][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:09:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:09:04,711][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:09:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:09:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:09:06,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:09:07,604][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:09:08,326][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:09:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:09:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:09:10,746][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:09:11,472][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:09:12,196][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:09:12,920][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:09:13,643][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:09:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:09:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:09:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:09:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:09:17,261][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:09:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:09:18,708][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:09:19,433][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:09:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:09:20,884][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:09:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:09:22,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:09:23,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:09:24,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:09:24,383][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:09:24,385][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:09:25,798][__main__][INFO] - Iteration 484 took 56s (9.21% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 58m 54s. Estimated total time: 15h 43m 45s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 22s, 500 more iterations: 7h 51m 52s. [2026-03-25 22:09:25,802][__main__][INFO] - Starting iteration 484. [2026-03-25 22:09:25,808][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:09:25,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:09:30,928][__main__][INFO] - Number of regex retries in iteration 484: 0 [2026-03-25 22:09:30,929][__main__][INFO] - agents played in iteration 484 are Bob, Alice [2026-03-25 22:09:31,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:09:31,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:09:31,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:09:31,495][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:09:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:09:32,837][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:09:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:09:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:09:35,000][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:09:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:09:36,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:09:37,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:09:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:09:38,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:09:39,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:09:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:09:40,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:09:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:09:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:09:42,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:09:43,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:09:44,381][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:09:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:09:45,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:09:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:09:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:09:47,995][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:09:48,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:09:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:09:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:09:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:09:51,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:09:52,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:09:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:09:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:09:54,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:09:55,223][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:09:55,946][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:09:56,671][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:09:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:09:58,116][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:09:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:09:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:10:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:10:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:10:01,736][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:10:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:10:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:10:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:10:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:10:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:10:06,076][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:10:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:10:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:10:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:10:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:10:09,956][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:10:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:10:11,404][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:10:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:10:12,850][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:10:13,575][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:10:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:10:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:10:15,748][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:10:16,473][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:10:17,196][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:10:17,919][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:10:18,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:10:19,364][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:10:20,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:10:20,819][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:10:20,821][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:10:22,308][__main__][INFO] - Iteration 485 took 56s (9.06% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 55m 55s. Estimated total time: 15h 41m 43s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 10s, 500 more iterations: 7h 50m 51s. [2026-03-25 22:10:22,311][__main__][INFO] - Starting iteration 485. [2026-03-25 22:10:22,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:10:22,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:10:27,225][__main__][INFO] - Number of regex retries in iteration 485: 0 [2026-03-25 22:10:27,226][__main__][INFO] - agents played in iteration 485 are Bob, Alice [2026-03-25 22:10:27,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:10:27,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:10:27,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:10:27,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:10:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:10:29,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:10:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:10:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:10:31,296][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:10:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:10:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:10:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:10:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:10:34,902][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:10:35,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:10:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:10:37,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:10:37,790][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:10:38,512][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:10:39,235][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:10:39,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:10:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:10:41,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:10:42,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:10:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:10:43,570][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:10:44,293][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:10:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:10:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:10:46,466][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:10:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:10:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:10:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:10:49,364][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:10:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:10:50,808][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:10:51,531][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:10:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:10:52,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:10:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:10:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:10:55,151][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:10:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:10:56,597][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:10:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:10:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:10:58,770][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:10:59,494][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:11:00,219][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:11:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:11:01,666][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:11:02,390][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:11:03,343][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:11:04,068][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:11:04,790][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:11:05,514][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:11:06,238][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:11:06,962][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:11:07,687][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:11:08,412][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:11:09,138][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:11:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:11:10,588][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:11:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:11:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:11:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:11:13,485][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:11:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:11:14,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:11:15,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:11:16,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:11:16,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:11:16,902][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:11:18,291][__main__][INFO] - Iteration 486 took 55s (8.77% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 46m 11s. Estimated total time: 15h 32m 55s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 27s. [2026-03-25 22:11:18,294][__main__][INFO] - Starting iteration 486. [2026-03-25 22:11:18,299][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:11:18,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:11:23,254][__main__][INFO] - Number of regex retries in iteration 486: 0 [2026-03-25 22:11:23,255][__main__][INFO] - agents played in iteration 486 are Bob, Alice [2026-03-25 22:11:23,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:11:23,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:11:23,821][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:11:23,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:11:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:11:25,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:11:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:11:26,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:11:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:11:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:11:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:11:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:11:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:11:30,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:11:31,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:11:32,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:11:33,101][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:11:33,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:11:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:11:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:11:35,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:11:36,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:11:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:11:38,156][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:11:38,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:11:39,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:11:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:11:41,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:11:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:11:42,496][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:11:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:11:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:11:44,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:11:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:11:46,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:11:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:11:47,558][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:11:48,280][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:11:49,003][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:11:49,726][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:11:50,450][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:11:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:11:51,898][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:11:52,623][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:11:53,348][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:11:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:11:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:11:55,516][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:11:56,241][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:11:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:11:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:11:58,414][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:11:59,368][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:12:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:12:00,817][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:12:01,543][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:12:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:12:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:12:03,713][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:12:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:12:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:12:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:12:06,610][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:12:07,335][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:12:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:12:08,786][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:12:09,512][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:12:10,236][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:12:10,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:12:11,713][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:12:12,833][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:12:12,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:12:12,838][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:12:14,405][__main__][INFO] - Iteration 487 took 56s (8.83% Gen, 88.37% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 47m 27s. Estimated total time: 15h 35m 7s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 30s, 500 more iterations: 7h 47m 33s. [2026-03-25 22:12:14,409][__main__][INFO] - Starting iteration 487. [2026-03-25 22:12:14,416][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:12:14,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:12:23,225][__main__][INFO] - Number of regex retries in iteration 487: 0 [2026-03-25 22:12:23,226][__main__][INFO] - agents played in iteration 487 are Bob, Alice [2026-03-25 22:12:23,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:12:23,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:12:23,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:12:23,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:12:24,480][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:12:25,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:12:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:12:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:12:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:12:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:12:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:12:29,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:12:30,165][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:12:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:12:31,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:12:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:12:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:12:33,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:12:34,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:12:35,201][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:12:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:12:36,642][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:12:37,362][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:12:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:12:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:12:39,525][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:12:40,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:12:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:12:41,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:12:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:12:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:12:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:12:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:12:45,293][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:12:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:12:46,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:12:47,455][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:12:48,176][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:12:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:12:49,619][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:12:50,339][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:12:51,061][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:12:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:12:52,503][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:12:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:12:53,947][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:12:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:12:55,388][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:12:56,111][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:12:56,832][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:12:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:12:58,275][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:12:59,277][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:13:00,001][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:13:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:13:01,445][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:13:02,165][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:13:02,888][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:13:03,610][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:13:04,332][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:13:05,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:13:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:13:06,499][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:13:07,222][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:13:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:13:08,668][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:13:09,391][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:13:10,113][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:13:10,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:13:11,577][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:13:12,888][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:13:12,892][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:13:12,893][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:13:14,301][__main__][INFO] - Iteration 488 took 59s (14.71% Gen, 82.93% Train). Generation: 8s, Training: 49s. Estimated remaining time: 8h 49m 28s. Estimated total time: 16h 38m 8s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 48s, 500 more iterations: 8h 19m 4s. [2026-03-25 22:13:14,305][__main__][INFO] - Starting iteration 488. [2026-03-25 22:13:14,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:13:14,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:13:19,256][__main__][INFO] - Number of regex retries in iteration 488: 0 [2026-03-25 22:13:19,257][__main__][INFO] - agents played in iteration 488 are Bob, Alice [2026-03-25 22:13:19,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:13:19,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:13:19,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:13:19,857][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:13:20,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:13:21,203][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:13:21,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:13:22,642][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:13:23,363][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:13:24,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:13:24,801][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:13:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:13:26,240][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:13:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:13:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:13:28,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:13:29,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:13:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:13:30,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:13:31,282][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:13:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:13:32,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:13:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:13:34,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:13:34,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:13:35,605][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:13:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:13:37,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:13:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:13:38,489][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:13:39,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:13:39,932][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:13:40,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:13:41,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:13:42,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:13:42,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:13:43,537][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:13:44,259][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:13:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:13:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:13:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:13:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:13:47,865][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:13:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:13:49,307][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:13:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:13:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:13:51,473][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:13:52,196][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:13:52,918][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:13:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:13:54,367][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:13:55,332][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:13:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:13:56,779][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:13:57,502][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:13:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:13:58,948][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:13:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:14:00,398][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:14:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:14:01,846][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:14:02,569][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:14:03,293][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:14:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:14:04,738][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:14:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:14:06,186][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:14:06,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:14:07,637][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:14:08,923][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:14:08,927][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:14:08,928][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:14:10,385][__main__][INFO] - Iteration 489 took 56s (8.81% Gen, 88.58% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 44m 59s. Estimated total time: 15h 34m 36s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 18s. [2026-03-25 22:14:10,388][__main__][INFO] - Starting iteration 489. [2026-03-25 22:14:10,394][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:14:10,396][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:14:15,282][__main__][INFO] - Number of regex retries in iteration 489: 0 [2026-03-25 22:14:15,284][__main__][INFO] - agents played in iteration 489 are Bob, Alice [2026-03-25 22:14:15,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:14:15,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:14:15,938][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:14:15,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:14:16,711][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:14:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:14:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:14:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:14:19,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:14:20,248][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:14:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:14:21,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:14:22,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:14:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:14:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:14:24,571][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:14:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:14:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:14:26,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:14:27,450][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:14:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:14:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:14:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:14:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:14:31,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:14:31,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:14:32,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:14:33,216][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:14:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:14:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:14:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:14:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:14:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:14:37,541][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:14:38,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:14:38,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:14:39,706][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:14:40,426][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:14:41,149][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:14:41,869][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:14:42,590][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:14:43,312][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:14:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:14:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:14:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:14:46,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:14:46,923][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:14:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:14:48,368][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:14:49,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:14:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:14:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:14:51,500][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:14:52,224][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:14:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:14:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:14:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:14:55,114][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:14:55,836][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:14:56,559][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:14:57,281][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:14:58,005][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:14:58,728][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:14:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:15:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:15:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:15:01,619][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:15:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:15:03,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:15:03,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:15:05,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:15:05,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:15:05,045][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:15:07,180][__main__][INFO] - Iteration 490 took 56s (8.61% Gen, 87.63% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 55m 55s. Estimated total time: 15h 46m 27s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 38s, 500 more iterations: 7h 53m 13s. [2026-03-25 22:15:07,183][__main__][INFO] - Starting iteration 490. [2026-03-25 22:15:07,195][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:15:07,196][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:15:12,183][__main__][INFO] - Number of regex retries in iteration 490: 0 [2026-03-25 22:15:12,185][__main__][INFO] - agents played in iteration 490 are Bob, Alice [2026-03-25 22:15:12,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:15:12,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:15:12,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:15:12,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:15:13,492][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:15:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:15:14,867][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:15:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:15:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:15:17,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:15:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:15:18,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:15:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:15:19,900][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:15:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:15:21,338][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:15:22,057][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:15:22,777][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:15:23,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:15:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:15:24,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:15:25,655][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:15:26,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:15:27,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:15:27,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:15:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:15:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:15:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:15:30,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:15:31,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:15:32,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:15:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:15:33,580][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:15:34,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:15:35,023][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:15:35,744][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:15:36,465][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:15:37,185][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:15:37,906][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:15:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:15:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:15:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:15:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:15:41,513][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:15:42,235][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:15:42,957][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:15:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:15:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:15:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:15:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:15:46,563][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:15:47,283][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:15:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:15:49,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:15:49,727][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:15:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:15:51,171][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:15:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:15:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:15:53,335][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:15:54,057][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:15:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:15:55,500][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:15:56,222][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:15:56,946][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:15:57,665][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:15:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:15:59,111][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:15:59,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:16:00,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:16:01,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:16:01,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:16:01,690][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:16:03,090][__main__][INFO] - Iteration 491 took 55s (8.92% Gen, 88.56% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 40m 8s. Estimated total time: 15h 31m 37s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 48s. [2026-03-25 22:16:03,094][__main__][INFO] - Starting iteration 491. [2026-03-25 22:16:03,098][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:16:03,099][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:16:08,523][__main__][INFO] - Number of regex retries in iteration 491: 0 [2026-03-25 22:16:08,525][__main__][INFO] - agents played in iteration 491 are Bob, Alice [2026-03-25 22:16:09,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:16:09,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:16:09,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:16:09,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:16:10,142][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:16:10,791][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:16:11,512][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:16:12,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:16:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:16:13,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:16:14,383][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:16:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:16:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:16:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:16:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:16:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:16:18,696][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:16:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:16:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:16:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:16:21,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:16:22,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:16:23,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:16:23,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:16:24,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:16:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:16:25,892][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:16:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:16:27,331][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:16:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:16:28,771][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:16:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:16:30,212][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:16:30,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:16:31,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:16:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:16:33,093][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:16:33,813][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:16:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:16:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:16:35,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:16:36,697][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:16:37,418][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:16:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:16:38,860][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:16:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:16:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:16:41,023][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:16:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:16:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:16:43,184][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:16:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:16:44,863][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:16:45,585][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:16:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:16:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:16:47,749][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:16:48,469][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:16:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:16:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:16:50,632][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:16:51,351][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:16:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:16:52,796][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:16:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:16:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:16:54,962][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:16:55,777][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:16:56,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:16:57,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:16:58,509][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:16:58,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:16:58,516][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:17:00,081][__main__][INFO] - Iteration 492 took 56s (9.52% Gen, 87.73% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 57m 19s. Estimated total time: 15h 49m 44s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 58s, 500 more iterations: 7h 54m 52s. [2026-03-25 22:17:00,086][__main__][INFO] - Starting iteration 492. [2026-03-25 22:17:00,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:17:00,092][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:17:04,981][__main__][INFO] - Number of regex retries in iteration 492: 0 [2026-03-25 22:17:04,982][__main__][INFO] - agents played in iteration 492 are Bob, Alice [2026-03-25 22:17:05,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:17:05,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:17:05,556][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:17:05,557][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:17:06,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:17:06,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:17:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:17:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:17:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:17:09,767][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:17:10,486][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:17:11,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:17:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:17:12,642][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:17:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:17:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:17:14,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:17:15,521][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:17:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:17:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:17:17,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:17:18,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:17:19,117][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:17:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:17:20,556][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:17:21,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:17:21,993][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:17:22,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:17:23,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:17:24,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:17:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:17:25,591][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:17:26,311][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:17:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:17:27,751][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:17:28,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:17:29,192][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:17:29,912][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:17:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:17:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:17:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:17:32,789][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:17:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:17:34,230][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:17:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:17:35,669][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:17:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:17:37,110][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:17:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:17:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:17:39,272][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:17:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:17:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:17:41,669][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:17:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:17:43,109][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:17:43,831][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:17:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:17:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:17:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:17:46,715][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:17:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:17:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:17:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:17:49,599][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:17:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:17:51,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:17:51,762][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:17:52,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:17:53,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:17:54,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:17:54,445][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:17:54,447][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:17:55,850][__main__][INFO] - Iteration 493 took 55s (8.77% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 35m 58s. Estimated total time: 15h 29m 20s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 40s. [2026-03-25 22:17:55,858][__main__][INFO] - Starting iteration 493. [2026-03-25 22:17:55,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:17:55,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:18:00,811][__main__][INFO] - Number of regex retries in iteration 493: 0 [2026-03-25 22:18:00,812][__main__][INFO] - agents played in iteration 493 are Bob, Alice [2026-03-25 22:18:01,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:18:01,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:18:01,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:18:01,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:18:02,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:18:02,705][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:18:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:18:04,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:18:04,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:18:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:18:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:18:07,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:18:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:18:08,453][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:18:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:18:09,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:18:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:18:11,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:18:12,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:18:12,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:18:13,487][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:18:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:18:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:18:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:18:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:18:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:18:17,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:18:18,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:18:19,241][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:18:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:18:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:18:21,400][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:18:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:18:22,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:18:23,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:18:24,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:18:24,997][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:18:25,717][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:18:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:18:27,157][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:18:27,877][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:18:28,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:18:29,320][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:18:30,039][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:18:30,759][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:18:31,481][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:18:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:18:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:18:33,641][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:18:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:18:35,082][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:18:35,802][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:18:36,840][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:18:37,562][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:18:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:18:39,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:18:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:18:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:18:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:18:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:18:42,607][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:18:43,328][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:18:44,049][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:18:44,770][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:18:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:18:46,211][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:18:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:18:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:18:48,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:18:49,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:18:50,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:18:50,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:18:50,207][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:18:51,572][__main__][INFO] - Iteration 494 took 55s (8.87% Gen, 88.68% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 34m 5s. Estimated total time: 15h 28m 23s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 11s. [2026-03-25 22:18:51,576][__main__][INFO] - Starting iteration 494. [2026-03-25 22:18:51,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:18:51,584][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:18:56,485][__main__][INFO] - Number of regex retries in iteration 494: 0 [2026-03-25 22:18:56,486][__main__][INFO] - agents played in iteration 494 are Bob, Alice [2026-03-25 22:18:56,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:18:57,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:18:57,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:18:57,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:18:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:18:58,428][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:18:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:18:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:19:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:19:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:19:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:19:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:19:03,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:19:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:19:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:19:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:19:06,336][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:19:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:19:07,775][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:19:08,494][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:19:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:19:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:19:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:19:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:19:12,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:19:12,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:19:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:19:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:19:14,970][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:19:15,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:19:16,412][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:19:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:19:17,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:19:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:19:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:19:20,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:19:20,729][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:19:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:19:22,171][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:19:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:19:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:19:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:19:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:19:25,770][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:19:26,492][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:19:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:19:27,931][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:19:28,653][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:19:29,372][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:19:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:19:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:19:31,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:19:32,489][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:19:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:19:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:19:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:19:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:19:36,096][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:19:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:19:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:19:38,260][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:19:38,981][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:19:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:19:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:19:41,143][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:19:41,866][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:19:42,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:19:43,306][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:19:44,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:19:44,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:19:46,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:19:46,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:19:46,073][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:19:47,654][__main__][INFO] - Iteration 495 took 56s (8.74% Gen, 88.43% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 39m 20s. Estimated total time: 15h 34m 34s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 17s. [2026-03-25 22:19:47,656][__main__][INFO] - Starting iteration 495. [2026-03-25 22:19:47,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:19:47,662][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:19:53,105][__main__][INFO] - Number of regex retries in iteration 495: 0 [2026-03-25 22:19:53,106][__main__][INFO] - agents played in iteration 495 are Bob, Alice [2026-03-25 22:19:53,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:19:53,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:19:53,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:19:53,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:19:54,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:19:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:19:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:19:56,473][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:19:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:19:57,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:19:58,628][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:19:59,346][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:20:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:20:00,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:20:01,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:20:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:20:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:20:03,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:20:04,374][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:20:05,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:20:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:20:06,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:20:07,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:20:07,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:20:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:20:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:20:10,123][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:20:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:20:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:20:12,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:20:13,999][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:20:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:20:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:20:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:20:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:20:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:20:17,317][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:20:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:20:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:20:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:20:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:20:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:20:21,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:20:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:20:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:20:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:20:24,519][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:20:25,238][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:20:25,960][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:20:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:20:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:20:28,120][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:20:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:20:29,795][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:20:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:20:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:20:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:20:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:20:33,397][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:20:34,117][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:20:34,837][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:20:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:20:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:20:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:20:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:20:38,445][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:20:39,168][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:20:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:20:40,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:20:41,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:20:42,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:20:42,614][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:20:42,616][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:20:43,982][__main__][INFO] - Iteration 496 took 56s (9.67% Gen, 87.90% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 42m 34s. Estimated total time: 15h 38m 44s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 52s, 500 more iterations: 7h 49m 22s. [2026-03-25 22:20:43,986][__main__][INFO] - Starting iteration 496. [2026-03-25 22:20:43,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:20:43,993][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:20:49,136][__main__][INFO] - Number of regex retries in iteration 496: 0 [2026-03-25 22:20:49,137][__main__][INFO] - agents played in iteration 496 are Bob, Alice [2026-03-25 22:20:50,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:20:50,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:20:50,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:20:50,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:20:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:20:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:20:52,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:20:53,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:20:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:20:54,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:20:55,160][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:20:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:20:56,596][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:20:57,314][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:20:58,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:20:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:20:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:21:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:21:00,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:21:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:21:02,345][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:21:03,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:21:03,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:21:04,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:21:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:21:05,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:21:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:21:07,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:21:08,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:21:08,817][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:21:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:21:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:21:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:21:11,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:21:12,416][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:21:13,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:21:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:21:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:21:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:21:16,014][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:21:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:21:17,453][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:21:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:21:18,895][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:21:19,614][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:21:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:21:21,055][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:21:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:21:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:21:23,216][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:21:23,934][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:21:24,657][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:21:25,663][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:21:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:21:27,104][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:21:27,826][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:21:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:21:29,267][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:21:29,989][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:21:30,710][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:21:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:21:32,153][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:21:32,873][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:21:33,595][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:21:34,317][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:21:35,037][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:21:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:21:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:21:37,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:21:37,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:21:38,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:21:38,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:21:38,878][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:21:40,203][__main__][INFO] - Iteration 497 took 56s (9.15% Gen, 88.49% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 39m 47s. Estimated total time: 15h 36m 53s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 26s. [2026-03-25 22:21:40,207][__main__][INFO] - Starting iteration 497. [2026-03-25 22:21:40,213][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:21:40,214][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:21:45,147][__main__][INFO] - Number of regex retries in iteration 497: 0 [2026-03-25 22:21:45,148][__main__][INFO] - agents played in iteration 497 are Bob, Alice [2026-03-25 22:21:45,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:21:45,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:21:45,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:21:45,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:21:46,418][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:21:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:21:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:21:48,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:21:49,221][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:21:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:21:50,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:21:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:21:52,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:21:52,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:21:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:21:54,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:21:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:21:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:21:56,408][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:21:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:21:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:21:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:21:59,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:22:00,005][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:22:00,725][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:22:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:22:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:22:02,883][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:22:03,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:22:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:22:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:22:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:22:06,482][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:22:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:22:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:22:08,642][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:22:09,364][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:22:10,084][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:22:10,804][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:22:11,523][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:22:12,245][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:22:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:22:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:22:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:22:15,127][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:22:15,847][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:22:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:22:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:22:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:22:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:22:19,450][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:22:20,172][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:22:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:22:21,848][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:22:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:22:23,288][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:22:24,009][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:22:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:22:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:22:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:22:26,893][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:22:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:22:28,334][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:22:29,055][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:22:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:22:30,496][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:22:31,217][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:22:31,940][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:22:32,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:22:33,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:22:34,638][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:22:34,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:22:34,644][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:22:35,991][__main__][INFO] - Iteration 498 took 55s (8.84% Gen, 88.73% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 31m 39s. Estimated total time: 15h 29m 40s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 50s. [2026-03-25 22:22:35,993][__main__][INFO] - Starting iteration 498. [2026-03-25 22:22:35,996][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:22:35,997][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:22:40,922][__main__][INFO] - Number of regex retries in iteration 498: 0 [2026-03-25 22:22:40,923][__main__][INFO] - agents played in iteration 498 are Bob, Alice [2026-03-25 22:22:41,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:22:41,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:22:41,492][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:22:41,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:22:42,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:22:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:22:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:22:44,261][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:22:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:22:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:22:46,414][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:22:47,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:22:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:22:48,570][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:22:49,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:22:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:22:50,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:22:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:22:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:22:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:22:53,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:22:54,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:22:55,039][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:22:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:22:56,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:22:57,196][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:22:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:22:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:22:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:23:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:23:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:23:01,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:23:02,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:23:02,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:23:03,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:23:04,392][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:23:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:23:05,831][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:23:06,552][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:23:07,271][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:23:07,992][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:23:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:23:09,435][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:23:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:23:10,874][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:23:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:23:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:23:13,032][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:23:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:23:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:23:15,192][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:23:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:23:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:23:17,592][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:23:18,312][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:23:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:23:19,753][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:23:20,474][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:23:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:23:21,914][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:23:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:23:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:23:24,077][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:23:24,798][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:23:25,518][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:23:26,239][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:23:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:23:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:23:28,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:23:29,163][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:23:30,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:23:30,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:23:30,158][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:23:31,504][__main__][INFO] - Iteration 499 took 55s (8.87% Gen, 88.70% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 26m 12s. Estimated total time: 15h 25m 9s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 34s. [2026-03-25 22:23:31,507][__main__][INFO] - Starting iteration 499. [2026-03-25 22:23:31,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:23:31,513][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:23:36,446][__main__][INFO] - Number of regex retries in iteration 499: 0 [2026-03-25 22:23:36,448][__main__][INFO] - agents played in iteration 499 are Bob, Alice [2026-03-25 22:23:36,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:23:37,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:23:37,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:23:37,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:23:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:23:38,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:23:39,110][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:23:39,828][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:23:40,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:23:41,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:23:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:23:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:23:43,419][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:23:44,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:23:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:23:45,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:23:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:23:47,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:23:47,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:23:48,450][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:23:49,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:23:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:23:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:23:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:23:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:23:52,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:23:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:23:54,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:23:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:23:55,644][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:23:56,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:23:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:23:57,802][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:23:58,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:23:59,241][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:23:59,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:24:00,682][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:24:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:24:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:24:02,842][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:24:03,561][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:24:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:24:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:24:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:24:06,440][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:24:07,160][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:24:07,880][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:24:08,600][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:24:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:24:10,042][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:24:10,763][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:24:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:24:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:24:13,197][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:24:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:24:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:24:15,357][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:24:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:24:16,798][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:24:17,520][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:24:18,240][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:24:18,961][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:24:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:24:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:24:21,123][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:24:21,842][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:24:22,563][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:24:23,284][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:24:24,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:24:24,733][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:24:25,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:24:25,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:24:25,680][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:24:26,980][__main__][INFO] - Iteration 500 took 55s (8.90% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 24m 38s. Estimated total time: 15h 24m 31s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 15s. [2026-03-25 22:24:26,983][__main__][INFO] - Starting iteration 500. [2026-03-25 22:24:26,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2026-03-25 22:24:26,987][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:24:31,954][__main__][INFO] - Number of regex retries in iteration 500: 0 [2026-03-25 22:24:31,955][__main__][INFO] - agents played in iteration 500 are Bob, Alice [2026-03-25 22:24:32,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:24:32,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:24:32,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:24:32,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:24:33,215][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:24:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:24:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:24:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:24:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:24:36,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:24:37,454][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:24:38,174][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:24:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:24:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:24:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:24:41,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:24:41,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:24:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:24:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:24:43,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:24:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:24:45,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:24:46,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:24:46,799][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:24:47,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:24:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:24:48,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:24:49,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:24:50,395][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:24:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:24:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:24:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:24:53,273][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:24:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:24:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:24:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:24:56,152][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:24:56,872][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:24:57,592][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:24:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:24:59,030][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:24:59,749][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:25:00,470][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:25:01,193][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:25:01,911][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:25:02,631][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:25:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:25:04,072][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:25:04,792][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:25:05,512][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:25:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:25:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:25:07,902][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:25:08,623][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:25:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:25:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:25:10,788][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:25:11,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:25:12,228][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:25:12,948][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:25:13,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:25:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:25:15,116][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:25:15,847][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:25:16,578][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:25:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:25:18,021][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:25:18,743][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:25:19,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:25:20,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:25:21,184][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:25:21,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:25:21,187][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:25:23,930][__main__][INFO] - Iteration 501 took 56s (8.72% Gen, 86.45% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 48m 15s. Estimated total time: 15h 49m 5s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 54s, 500 more iterations: 7h 54m 32s. [2026-03-25 22:25:23,951][__main__][INFO] - Starting iteration 501. [2026-03-25 22:25:23,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:25:23,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:25:29,114][__main__][INFO] - Number of regex retries in iteration 501: 0 [2026-03-25 22:25:29,115][__main__][INFO] - agents played in iteration 501 are Bob, Alice [2026-03-25 22:25:29,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:25:29,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:25:29,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:25:29,989][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:25:30,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:25:31,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:25:32,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:25:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:25:33,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:25:34,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:25:34,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:25:35,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:25:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:25:37,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:25:37,842][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:25:38,559][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:25:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:25:39,996][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:25:40,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:25:41,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:25:42,150][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:25:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:25:43,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:25:44,305][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:25:45,024][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:25:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:25:46,460][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:25:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:25:47,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:25:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:25:49,332][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:25:50,054][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:25:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:25:51,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:25:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:25:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:25:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:25:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:25:55,088][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:25:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:25:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:25:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:25:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:25:58,686][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:25:59,407][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:26:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:26:00,846][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:26:01,566][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:26:02,285][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:26:03,005][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:26:03,725][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:26:04,446][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:26:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:26:06,120][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:26:06,839][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:26:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:26:08,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:26:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:26:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:26:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:26:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:26:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:26:12,602][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:26:13,322][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:26:14,041][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:26:14,762][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:26:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:26:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:26:16,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:26:17,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:26:18,621][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:26:18,623][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:26:18,625][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:26:20,196][__main__][INFO] - Iteration 502 took 56s (9.12% Gen, 88.08% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 35m 4s. Estimated total time: 15h 36m 50s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 25s. [2026-03-25 22:26:20,200][__main__][INFO] - Starting iteration 502. [2026-03-25 22:26:20,205][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:26:20,206][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:26:25,179][__main__][INFO] - Number of regex retries in iteration 502: 0 [2026-03-25 22:26:25,181][__main__][INFO] - agents played in iteration 502 are Bob, Alice [2026-03-25 22:26:25,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:26:25,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:26:25,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:26:25,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:26:26,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:26:27,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:26:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:26:28,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:26:29,255][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:26:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:26:30,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:26:31,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:26:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:26:32,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:26:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:26:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:26:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:26:35,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:26:36,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:26:37,154][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:26:37,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:26:38,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:26:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:26:40,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:26:40,748][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:26:41,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:26:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:26:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:26:43,624][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:26:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:26:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:26:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:26:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:26:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:26:47,937][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:26:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:26:49,376][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:26:50,094][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:26:50,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:26:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:26:52,252][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:26:52,974][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:26:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:26:54,412][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:26:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:26:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:26:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:26:57,290][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:26:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:26:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:26:59,449][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:27:00,168][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:27:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:27:01,881][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:27:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:27:03,321][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:27:04,041][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:27:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:27:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:27:06,200][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:27:06,920][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:27:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:27:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:27:09,079][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:27:09,799][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:27:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:27:11,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:27:11,961][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:27:12,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:27:13,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:27:14,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:27:14,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:27:14,439][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:27:15,894][__main__][INFO] - Iteration 503 took 55s (8.93% Gen, 88.45% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 25m 29s. Estimated total time: 15h 28m 10s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 5s. [2026-03-25 22:27:15,898][__main__][INFO] - Starting iteration 503. [2026-03-25 22:27:15,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:27:15,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:27:20,950][__main__][INFO] - Number of regex retries in iteration 503: 0 [2026-03-25 22:27:20,951][__main__][INFO] - agents played in iteration 503 are Bob, Alice [2026-03-25 22:27:21,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:27:21,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:27:21,546][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:27:21,547][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:27:22,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:27:22,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:27:23,614][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:27:24,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:27:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:27:25,765][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:27:26,481][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:27:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:27:27,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:27:28,635][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:27:29,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:27:30,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:27:30,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:27:31,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:27:32,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:27:32,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:27:33,662][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:27:34,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:27:35,100][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:27:35,819][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:27:36,536][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:27:37,255][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:27:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:27:38,694][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:27:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:27:40,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:27:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:27:41,570][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:27:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:27:43,008][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:27:43,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:27:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:27:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:27:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:27:46,602][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:27:47,321][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:27:48,043][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:27:48,761][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:27:49,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:27:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:27:50,919][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:27:51,640][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:27:52,360][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:27:53,081][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:27:53,800][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:27:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:27:55,239][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:27:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:27:56,908][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:27:57,628][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:27:58,348][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:27:59,068][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:27:59,787][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:28:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:28:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:28:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:28:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:28:03,387][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:28:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:28:04,827][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:28:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:28:06,267][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:28:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:28:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:28:08,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:28:09,167][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:28:10,145][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:28:10,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:28:10,148][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:28:11,519][__main__][INFO] - Iteration 504 took 55s (9.07% Gen, 88.46% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 23m 20s. Estimated total time: 15h 26m 57s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 41s, 500 more iterations: 7h 43m 28s. [2026-03-25 22:28:11,522][__main__][INFO] - Starting iteration 504. [2026-03-25 22:28:11,526][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:28:11,527][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:28:16,441][__main__][INFO] - Number of regex retries in iteration 504: 0 [2026-03-25 22:28:16,442][__main__][INFO] - agents played in iteration 504 are Bob, Alice [2026-03-25 22:28:17,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:28:17,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:28:17,102][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:28:17,103][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:28:17,788][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:28:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:28:19,153][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:28:19,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:28:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:28:21,307][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:28:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:28:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:28:23,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:28:24,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:28:24,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:28:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:28:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:28:27,050][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:28:27,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:28:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:28:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:28:29,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:28:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:28:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:28:32,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:28:32,797][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:28:33,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:28:34,237][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:28:34,953][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:28:35,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:28:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:28:37,110][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:28:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:28:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:28:39,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:28:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:28:40,707][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:28:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:28:42,147][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:28:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:28:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:28:44,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:28:45,025][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:28:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:28:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:28:47,183][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:28:47,903][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:28:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:28:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:28:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:28:50,783][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:28:51,503][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:28:52,461][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:28:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:28:53,904][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:28:54,623][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:28:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:28:56,063][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:28:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:28:57,503][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:28:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:28:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:28:59,666][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:29:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:29:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:29:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:29:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:29:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:29:03,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:29:04,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:29:05,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:29:05,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:29:05,674][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:29:07,066][__main__][INFO] - Iteration 505 took 55s (8.85% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 21m 9s. Estimated total time: 15h 25m 42s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 51s. [2026-03-25 22:29:07,069][__main__][INFO] - Starting iteration 505. [2026-03-25 22:29:07,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:29:07,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:29:15,757][__main__][INFO] - Number of regex retries in iteration 505: 0 [2026-03-25 22:29:15,758][__main__][INFO] - agents played in iteration 505 are Bob, Alice [2026-03-25 22:29:16,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:29:16,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:29:16,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:29:16,339][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:29:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:29:17,670][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:29:18,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:29:19,106][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:29:19,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:29:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:29:21,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:29:21,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:29:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:29:23,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:29:24,123][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:29:24,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:29:25,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:29:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:29:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:29:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:29:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:29:29,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:29:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:29:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:29:31,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:29:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:29:32,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:29:33,450][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:29:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:29:34,885][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:29:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:29:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:29:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:29:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:29:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:29:39,193][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:29:39,912][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:29:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:29:41,349][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:29:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:29:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:29:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:29:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:29:44,943][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:29:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:29:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:29:47,100][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:29:47,817][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:29:48,537][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:29:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:29:49,974][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:29:50,694][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:29:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:29:52,428][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:29:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:29:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:29:54,586][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:29:55,306][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:29:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:29:56,745][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:29:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:29:58,183][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:29:58,903][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:29:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:30:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:30:01,061][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:30:01,780][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:30:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:30:03,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:30:03,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:30:04,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:30:04,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:30:04,894][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:30:06,219][__main__][INFO] - Iteration 506 took 59s (14.68% Gen, 83.08% Train). Generation: 8s, Training: 49s. Estimated remaining time: 8h 20m 13s. Estimated total time: 16h 25m 45s. Time estimates for 10 more iterations: 9m 51s, 100 more iterations: 1h 38m 34s, 500 more iterations: 8h 12m 52s. [2026-03-25 22:30:06,222][__main__][INFO] - Starting iteration 506. [2026-03-25 22:30:06,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:30:06,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:30:11,195][__main__][INFO] - Number of regex retries in iteration 506: 0 [2026-03-25 22:30:11,196][__main__][INFO] - agents played in iteration 506 are Bob, Alice [2026-03-25 22:30:11,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:30:12,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:30:12,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:30:12,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:30:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:30:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:30:14,106][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:30:14,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:30:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:30:16,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:30:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:30:17,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:30:18,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:30:19,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:30:19,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:30:20,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:30:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:30:21,998][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:30:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:30:23,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:30:24,150][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:30:24,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:30:25,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:30:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:30:27,024][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:30:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:30:28,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:30:29,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:30:29,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:30:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:30:31,338][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:30:32,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:30:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:30:33,494][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:30:34,213][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:30:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:30:35,652][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:30:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:30:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:30:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:30:38,529][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:30:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:30:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:30:40,689][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:30:41,408][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:30:42,127][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:30:42,847][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:30:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:30:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:30:45,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:30:45,730][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:30:46,451][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:30:47,416][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:30:48,137][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:30:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:30:49,576][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:30:50,298][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:30:51,017][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:30:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:30:52,462][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:30:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:30:53,904][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:30:54,623][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:30:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:30:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:30:56,788][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:30:57,510][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:30:58,230][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:30:58,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:30:59,694][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:31:00,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:31:00,587][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:31:00,588][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:31:01,982][__main__][INFO] - Iteration 507 took 55s (8.91% Gen, 88.58% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 22m 49s. Estimated total time: 15h 29m 17s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 38s. [2026-03-25 22:31:01,984][__main__][INFO] - Starting iteration 507. [2026-03-25 22:31:01,989][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:31:01,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:31:06,914][__main__][INFO] - Number of regex retries in iteration 507: 0 [2026-03-25 22:31:06,916][__main__][INFO] - agents played in iteration 507 are Bob, Alice [2026-03-25 22:31:07,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:31:07,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:31:07,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:31:07,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:31:08,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:31:08,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:31:09,530][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:31:10,253][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:31:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:31:11,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:31:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:31:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:31:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:31:14,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:31:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:31:16,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:31:16,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:31:17,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:31:18,173][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:31:18,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:31:19,611][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:31:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:31:21,047][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:31:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:31:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:31:23,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:31:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:31:24,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:31:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:31:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:31:26,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:31:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:31:28,252][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:31:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:31:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:31:30,415][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:31:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:31:31,859][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:31:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:31:33,300][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:31:34,023][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:31:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:31:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:31:36,186][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:31:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:31:37,627][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:31:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:31:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:31:39,792][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:31:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:31:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:31:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:31:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:31:43,645][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:31:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:31:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:31:45,809][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:31:46,531][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:31:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:31:47,972][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:31:48,693][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:31:49,414][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:31:50,135][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:31:50,856][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:31:51,578][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:31:52,300][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:31:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:31:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:31:54,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:31:55,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:31:56,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:31:56,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:31:56,396][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:31:57,910][__main__][INFO] - Iteration 508 took 55s (8.81% Gen, 88.48% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 24m 40s. Estimated total time: 15h 32m 3s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 1s. [2026-03-25 22:31:57,913][__main__][INFO] - Starting iteration 508. [2026-03-25 22:31:57,919][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:31:57,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:32:02,989][__main__][INFO] - Number of regex retries in iteration 508: 0 [2026-03-25 22:32:02,990][__main__][INFO] - agents played in iteration 508 are Bob, Alice [2026-03-25 22:32:03,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:32:03,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:32:03,577][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:32:03,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:32:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:32:04,937][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:32:05,658][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:32:06,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:32:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:32:07,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:32:08,533][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:32:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:32:09,970][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:32:10,687][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:32:11,404][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:32:12,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:32:12,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:32:13,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:32:14,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:32:14,993][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:32:15,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:32:16,430][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:32:17,150][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:32:17,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:32:18,585][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:32:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:32:20,020][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:32:20,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:32:21,457][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:32:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:32:22,896][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:32:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:32:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:32:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:32:25,768][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:32:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:32:27,206][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:32:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:32:28,646][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:32:29,364][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:32:30,084][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:32:30,803][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:32:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:32:32,243][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:32:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:32:33,681][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:32:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:32:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:32:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:32:36,559][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:32:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:32:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:32:39,004][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:32:39,725][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:32:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:32:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:32:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:32:42,602][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:32:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:32:44,041][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:32:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:32:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:32:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:32:46,922][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:32:47,641][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:32:48,360][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:32:49,081][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:32:49,801][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:32:50,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:32:51,263][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:32:52,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:32:52,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:32:52,251][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:32:53,579][__main__][INFO] - Iteration 509 took 55s (9.11% Gen, 88.50% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 19m 24s. Estimated total time: 15h 27m 43s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 51s. [2026-03-25 22:32:53,581][__main__][INFO] - Starting iteration 509. [2026-03-25 22:32:53,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:32:53,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:32:58,508][__main__][INFO] - Number of regex retries in iteration 509: 0 [2026-03-25 22:32:58,509][__main__][INFO] - agents played in iteration 509 are Bob, Alice [2026-03-25 22:32:59,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:32:59,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:32:59,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:32:59,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:32:59,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:33:00,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:33:01,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:33:01,856][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:33:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:33:03,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:33:04,007][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:33:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:33:05,443][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:33:06,160][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:33:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:33:07,598][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:33:08,316][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:33:09,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:33:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:33:10,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:33:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:33:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:33:12,627][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:33:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:33:14,062][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:33:14,781][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:33:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:33:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:33:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:33:17,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:33:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:33:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:33:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:33:20,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:33:21,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:33:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:33:22,684][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:33:23,402][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:33:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:33:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:33:25,559][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:33:26,279][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:33:26,997][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:33:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:33:28,438][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:33:29,157][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:33:29,876][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:33:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:33:31,315][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:33:32,035][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:33:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:33:33,475][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:33:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:33:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:33:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:33:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:33:37,335][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:33:38,054][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:33:38,775][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:33:39,496][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:33:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:33:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:33:41,658][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:33:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:33:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:33:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:33:44,539][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:33:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:33:45,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:33:46,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:33:47,642][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:33:47,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:33:47,646][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:33:49,203][__main__][INFO] - Iteration 510 took 55s (8.85% Gen, 88.35% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 17m 45s. Estimated total time: 15h 27m 0s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 42s, 500 more iterations: 7h 43m 30s. [2026-03-25 22:33:49,206][__main__][INFO] - Starting iteration 510. [2026-03-25 22:33:49,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:33:49,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:33:54,096][__main__][INFO] - Number of regex retries in iteration 510: 0 [2026-03-25 22:33:54,097][__main__][INFO] - agents played in iteration 510 are Bob, Alice [2026-03-25 22:33:54,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:33:54,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:33:54,682][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:33:54,682][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:33:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:33:56,028][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:33:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:33:57,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:33:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:33:58,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:33:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:34:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:34:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:34:01,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:34:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:34:03,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:34:03,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:34:04,648][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:34:05,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:34:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:34:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:34:07,519][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:34:08,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:34:08,957][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:34:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:34:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:34:11,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:34:11,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:34:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:34:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:34:13,987][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:34:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:34:15,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:34:16,144][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:34:16,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:34:17,583][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:34:18,302][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:34:19,020][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:34:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:34:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:34:21,179][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:34:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:34:22,616][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:34:23,337][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:34:24,057][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:34:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:34:25,495][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:34:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:34:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:34:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:34:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:34:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:34:30,046][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:34:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:34:31,486][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:34:32,206][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:34:32,925][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:34:33,646][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:34:34,364][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:34:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:34:35,805][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:34:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:34:37,243][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:34:37,965][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:34:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:34:39,406][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:34:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:34:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:34:41,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:34:42,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:34:43,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:34:43,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:34:43,629][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:34:45,003][__main__][INFO] - Iteration 511 took 55s (8.76% Gen, 88.77% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 19m 44s. Estimated total time: 15h 29m 55s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 59s, 500 more iterations: 7h 44m 57s. [2026-03-25 22:34:45,006][__main__][INFO] - Starting iteration 511. [2026-03-25 22:34:45,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:34:45,011][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:34:50,177][__main__][INFO] - Number of regex retries in iteration 511: 0 [2026-03-25 22:34:50,178][__main__][INFO] - agents played in iteration 511 are Bob, Alice [2026-03-25 22:34:51,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:34:51,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:34:51,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:34:51,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:34:51,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:34:52,502][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:34:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:34:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:34:54,659][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:34:55,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:34:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:34:56,811][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:34:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:34:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:34:58,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:34:59,683][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:35:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:35:01,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:35:01,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:35:02,557][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:35:03,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:35:03,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:35:04,712][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:35:05,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:35:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:35:06,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:35:07,587][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:35:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:35:09,025][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:35:09,744][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:35:10,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:35:11,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:35:11,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:35:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:35:13,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:35:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:35:14,777][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:35:15,495][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:35:16,213][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:35:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:35:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:35:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:35:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:35:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:35:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:35:21,252][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:35:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:35:22,691][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:35:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:35:24,131][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:35:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:35:25,570][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:35:26,581][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:35:27,303][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:35:28,022][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:35:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:35:29,461][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:35:30,187][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:35:30,906][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:35:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:35:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:35:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:35:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:35:34,506][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:35:35,226][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:35:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:35:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:35:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:35:38,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:35:38,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:35:39,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:35:39,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:35:39,853][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:35:41,178][__main__][INFO] - Iteration 512 took 56s (9.20% Gen, 88.44% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 25m 2s. Estimated total time: 15h 36m 9s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 4s. [2026-03-25 22:35:41,181][__main__][INFO] - Starting iteration 512. [2026-03-25 22:35:41,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:35:41,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:35:46,140][__main__][INFO] - Number of regex retries in iteration 512: 0 [2026-03-25 22:35:46,142][__main__][INFO] - agents played in iteration 512 are Bob, Alice [2026-03-25 22:35:46,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:35:46,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:35:46,748][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:35:46,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:35:47,470][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:35:48,118][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:35:48,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:35:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:35:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:35:50,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:35:51,715][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:35:52,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:35:53,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:35:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:35:54,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:35:55,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:35:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:35:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:35:57,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:35:58,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:35:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:35:59,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:36:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:36:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:36:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:36:02,508][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:36:03,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:36:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:36:04,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:36:05,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:36:06,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:36:06,831][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:36:07,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:36:08,270][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:36:08,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:36:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:36:10,434][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:36:11,154][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:36:11,873][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:36:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:36:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:36:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:36:14,759][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:36:15,481][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:36:16,201][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:36:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:36:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:36:18,361][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:36:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:36:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:36:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:36:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:36:22,198][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:36:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:36:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:36:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:36:25,078][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:36:25,799][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:36:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:36:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:36:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:36:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:36:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:36:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:36:30,847][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:36:31,566][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:36:32,285][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:36:33,007][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:36:33,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:36:34,464][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:36:35,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:36:35,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:36:35,409][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:36:36,702][__main__][INFO] - Iteration 513 took 55s (8.93% Gen, 88.74% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 13m 16s. Estimated total time: 15h 25m 19s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 39s. [2026-03-25 22:36:36,706][__main__][INFO] - Starting iteration 513. [2026-03-25 22:36:36,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:36:36,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:36:41,815][__main__][INFO] - Number of regex retries in iteration 513: 0 [2026-03-25 22:36:41,816][__main__][INFO] - agents played in iteration 513 are Bob, Alice [2026-03-25 22:36:42,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:36:42,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:36:42,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:36:42,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:36:43,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:36:43,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:36:44,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:36:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:36:45,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:36:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:36:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:36:48,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:36:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:36:49,463][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:36:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:36:50,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:36:51,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:36:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:36:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:36:53,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:36:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:36:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:36:55,930][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:36:56,647][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:36:57,366][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:36:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:36:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:36:59,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:37:00,241][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:37:00,961][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:37:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:37:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:37:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:37:03,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:37:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:37:05,278][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:37:05,997][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:37:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:37:07,436][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:37:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:37:08,876][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:37:09,595][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:37:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:37:11,034][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:37:11,753][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:37:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:37:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:37:13,914][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:37:14,634][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:37:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:37:16,074][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:37:16,794][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:37:17,758][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:37:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:37:19,197][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:37:19,917][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:37:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:37:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:37:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:37:22,798][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:37:23,518][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:37:24,237][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:37:24,958][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:37:25,678][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:37:26,398][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:37:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:37:27,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:37:28,563][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:37:29,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:37:30,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:37:31,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:37:31,063][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:37:31,065][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:37:32,482][__main__][INFO] - Iteration 514 took 55s (9.15% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 16m 33s. Estimated total time: 15h 29m 31s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 45s. [2026-03-25 22:37:32,485][__main__][INFO] - Starting iteration 514. [2026-03-25 22:37:32,490][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:37:32,491][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:37:37,404][__main__][INFO] - Number of regex retries in iteration 514: 0 [2026-03-25 22:37:37,405][__main__][INFO] - agents played in iteration 514 are Bob, Alice [2026-03-25 22:37:37,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:37:37,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:37:37,975][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:37:37,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:37:38,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:37:39,326][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:37:40,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:37:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:37:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:37:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:37:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:37:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:37:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:37:45,065][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:37:45,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:37:46,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:37:47,224][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:37:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:37:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:37:49,380][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:37:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:37:50,817][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:37:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:37:52,254][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:37:52,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:37:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:37:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:37:55,128][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:37:55,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:37:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:37:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:37:58,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:37:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:37:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:38:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:38:00,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:38:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:38:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:38:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:38:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:38:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:38:05,199][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:38:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:38:06,639][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:38:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:38:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:38:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:38:09,521][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:38:10,240][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:38:10,962][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:38:11,682][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:38:12,402][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:38:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:38:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:38:14,855][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:38:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:38:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:38:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:38:17,739][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:38:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:38:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:38:19,899][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:38:20,619][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:38:21,339][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:38:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:38:22,778][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:38:23,500][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:38:24,220][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:38:24,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:38:25,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:38:26,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:38:26,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:38:26,946][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:38:28,304][__main__][INFO] - Iteration 515 took 55s (8.80% Gen, 88.76% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 16m 22s. Estimated total time: 15h 30m 16s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 8s. [2026-03-25 22:38:28,308][__main__][INFO] - Starting iteration 515. [2026-03-25 22:38:28,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:38:28,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:38:33,221][__main__][INFO] - Number of regex retries in iteration 515: 0 [2026-03-25 22:38:33,222][__main__][INFO] - agents played in iteration 515 are Bob, Alice [2026-03-25 22:38:33,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:38:33,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:38:33,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:38:33,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:38:34,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:38:35,171][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:38:35,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:38:36,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:38:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:38:38,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:38:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:38:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:38:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:38:40,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:38:41,635][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:38:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:38:43,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:38:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:38:44,505][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:38:45,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:38:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:38:46,661][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:38:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:38:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:38:48,818][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:38:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:38:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:38:50,973][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:38:51,691][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:38:52,410][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:38:53,129][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:38:53,848][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:38:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:38:55,285][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:38:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:38:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:38:57,444][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:38:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:38:58,885][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:38:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:39:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:39:01,042][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:39:01,761][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:39:02,480][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:39:03,200][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:39:03,919][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:39:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:39:05,359][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:39:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:39:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:39:07,517][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:39:08,237][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:39:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:39:09,919][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:39:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:39:11,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:39:12,077][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:39:12,797][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:39:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:39:14,238][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:39:14,957][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:39:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:39:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:39:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:39:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:39:18,561][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:39:19,281][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:39:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:39:20,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:39:21,472][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:39:22,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:39:22,550][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:39:22,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:39:23,957][__main__][INFO] - Iteration 516 took 55s (8.82% Gen, 88.65% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 12m 37s. Estimated total time: 15h 27m 27s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 43s. [2026-03-25 22:39:23,960][__main__][INFO] - Starting iteration 516. [2026-03-25 22:39:23,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:39:23,966][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:39:28,947][__main__][INFO] - Number of regex retries in iteration 516: 0 [2026-03-25 22:39:28,948][__main__][INFO] - agents played in iteration 516 are Bob, Alice [2026-03-25 22:39:29,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:39:29,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:39:29,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:39:29,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:39:30,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:39:30,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:39:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:39:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:39:33,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:39:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:39:34,486][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:39:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:39:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:39:36,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:39:37,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:39:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:39:38,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:39:39,521][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:39:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:39:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:39:41,679][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:39:42,399][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:39:43,119][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:39:43,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:39:44,558][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:39:45,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:39:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:39:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:39:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:39:48,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:39:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:39:49,600][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:39:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:39:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:39:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:39:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:39:53,198][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:39:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:39:54,638][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:39:55,360][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:39:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:39:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:39:57,520][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:39:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:39:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:39:59,681][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:40:00,402][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:40:01,124][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:40:01,846][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:40:02,565][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:40:03,288][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:40:04,007][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:40:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:40:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:40:06,409][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:40:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:40:07,855][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:40:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:40:09,297][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:40:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:40:10,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:40:11,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:40:12,178][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:40:12,900][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:40:13,620][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:40:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:40:15,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:40:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:40:16,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:40:17,255][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:40:18,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:40:18,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:40:18,341][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:40:19,783][__main__][INFO] - Iteration 517 took 55s (8.93% Gen, 88.48% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 14m 35s. Estimated total time: 15h 30m 20s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 10s. [2026-03-25 22:40:19,785][__main__][INFO] - Starting iteration 517. [2026-03-25 22:40:19,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:40:19,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:40:24,686][__main__][INFO] - Number of regex retries in iteration 517: 0 [2026-03-25 22:40:24,687][__main__][INFO] - agents played in iteration 517 are Bob, Alice [2026-03-25 22:40:25,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:40:25,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:40:25,280][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:40:25,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:40:25,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:40:26,630][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:40:27,352][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:40:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:40:28,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:40:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:40:30,229][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:40:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:40:31,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:40:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:40:33,102][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:40:33,821][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:40:34,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:40:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:40:35,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:40:36,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:40:37,419][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:40:38,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:40:38,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:40:39,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:40:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:40:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:40:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:40:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:40:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:40:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:40:44,618][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:40:45,341][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:40:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:40:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:40:47,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:40:48,220][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:40:48,940][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:40:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:40:50,381][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:40:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:40:51,822][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:40:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:40:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:40:53,984][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:40:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:40:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:40:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:40:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:40:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:40:58,308][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:40:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:40:59,748][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:41:00,760][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:41:01,483][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:41:02,203][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:41:02,924][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:41:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:41:04,365][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:41:05,085][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:41:05,805][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:41:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:41:07,244][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:41:07,965][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:41:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:41:09,406][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:41:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:41:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:41:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:41:12,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:41:13,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:41:14,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:41:14,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:41:14,415][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:41:15,909][__main__][INFO] - Iteration 518 took 56s (8.73% Gen, 88.61% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 18m 40s. Estimated total time: 15h 35m 21s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 32s, 500 more iterations: 7h 47m 40s. [2026-03-25 22:41:15,914][__main__][INFO] - Starting iteration 518. [2026-03-25 22:41:15,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:41:15,919][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:41:21,688][__main__][INFO] - Number of regex retries in iteration 518: 0 [2026-03-25 22:41:21,689][__main__][INFO] - agents played in iteration 518 are Bob, Alice [2026-03-25 22:41:22,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:41:22,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:41:22,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:41:22,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:41:23,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:41:23,714][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:41:24,434][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:41:25,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:41:25,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:41:26,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:41:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:41:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:41:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:41:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:41:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:41:30,898][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:41:31,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:41:32,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:41:33,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:41:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:41:34,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:41:35,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:41:35,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:41:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:41:37,365][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:41:38,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:41:38,803][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:41:39,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:41:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:41:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:41:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:41:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:41:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:41:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:41:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:41:45,271][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:41:45,990][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:41:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:41:47,427][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:41:48,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:41:48,865][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:41:49,585][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:41:50,304][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:41:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:41:51,742][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:41:52,462][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:41:53,181][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:41:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:41:54,621][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:41:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:41:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:41:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:41:57,745][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:41:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:41:59,185][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:41:59,907][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:42:00,628][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:42:01,347][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:42:02,068][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:42:02,788][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:42:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:42:04,227][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:42:04,948][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:42:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:42:06,387][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:42:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:42:07,828][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:42:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:42:09,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:42:10,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:42:11,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:42:11,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:42:11,205][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:42:12,633][__main__][INFO] - Iteration 519 took 56s (10.17% Gen, 87.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 27m 39s. Estimated total time: 15h 45m 17s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 31s, 500 more iterations: 7h 52m 38s. [2026-03-25 22:42:12,636][__main__][INFO] - Starting iteration 519. [2026-03-25 22:42:12,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:42:12,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:42:17,534][__main__][INFO] - Number of regex retries in iteration 519: 0 [2026-03-25 22:42:17,536][__main__][INFO] - agents played in iteration 519 are Bob, Alice [2026-03-25 22:42:18,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:42:18,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:42:18,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:42:18,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:42:18,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:42:19,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:42:20,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:42:20,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:42:21,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:42:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:42:23,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:42:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:42:24,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:42:25,201][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:42:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:42:26,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:42:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:42:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:42:28,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:42:29,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:42:30,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:42:30,954][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:42:31,674][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:42:32,395][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:42:33,115][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:42:33,834][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:42:34,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:42:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:42:35,994][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:42:36,715][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:42:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:42:38,155][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:42:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:42:39,596][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:42:40,316][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:42:41,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:42:41,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:42:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:42:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:42:43,919][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:42:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:42:45,362][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:42:46,083][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:42:46,804][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:42:47,526][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:42:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:42:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:42:49,688][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:42:50,410][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:42:51,131][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:42:51,853][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:42:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:42:53,546][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:42:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:42:54,990][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:42:55,712][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:42:56,431][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:42:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:42:57,876][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:42:58,597][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:42:59,319][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:43:00,041][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:43:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:43:01,485][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:43:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:43:02,927][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:43:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:43:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:43:05,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:43:05,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:43:06,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:43:06,956][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:43:06,957][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:43:08,364][__main__][INFO] - Iteration 520 took 55s (8.78% Gen, 88.69% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 10m 11s. Estimated total time: 15h 28m 45s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 22s. [2026-03-25 22:43:08,367][__main__][INFO] - Starting iteration 520. [2026-03-25 22:43:08,371][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:43:08,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:43:13,427][__main__][INFO] - Number of regex retries in iteration 520: 0 [2026-03-25 22:43:13,428][__main__][INFO] - agents played in iteration 520 are Bob, Alice [2026-03-25 22:43:13,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:43:14,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:43:14,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:43:14,029][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:43:14,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:43:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:43:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:43:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:43:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:43:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:43:18,987][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:43:19,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:43:20,425][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:43:21,143][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:43:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:43:22,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:43:23,296][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:43:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:43:24,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:43:25,451][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:43:27,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:43:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:43:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:43:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:43:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:43:30,988][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:43:31,707][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:43:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:43:33,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:43:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:43:34,582][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:43:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:43:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:43:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:43:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:43:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:43:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:43:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:43:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:43:41,052][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:43:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:43:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:43:43,208][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:43:43,929][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:43:44,646][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:43:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:43:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:43:46,802][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:43:47,524][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:43:48,243][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:43:48,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:43:49,687][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:43:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:43:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:43:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:43:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:43:53,530][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:43:54,252][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:43:54,971][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:43:55,691][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:43:56,414][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:43:57,134][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:43:57,854][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:43:58,574][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:43:59,295][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:44:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:44:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:44:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:44:02,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:44:02,947][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 22:44:04,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:44:04,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:44:04,413][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:44:05,853][__main__][INFO] - Iteration 521 took 57s (8.80% Gen, 88.69% Train). Generation: 5s, Training: 50s. Estimated remaining time: 7h 38m 33s. Estimated total time: 15h 58m 4s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 48s, 500 more iterations: 7h 59m 2s. [2026-03-25 22:44:05,857][__main__][INFO] - Starting iteration 521. [2026-03-25 22:44:05,863][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:44:05,863][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:44:10,931][__main__][INFO] - Number of regex retries in iteration 521: 0 [2026-03-25 22:44:10,932][__main__][INFO] - agents played in iteration 521 are Bob, Alice [2026-03-25 22:44:11,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:44:11,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:44:11,495][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:44:11,496][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:44:12,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:44:12,832][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:44:13,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:44:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:44:14,988][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:44:15,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:44:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:44:17,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:44:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:44:18,573][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:44:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:44:20,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:44:20,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:44:21,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:44:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:44:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:44:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:44:24,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:44:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:44:25,752][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:44:26,470][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:44:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:44:27,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:44:28,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:44:29,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:44:30,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:44:30,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:44:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:44:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:44:32,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:44:33,658][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:44:34,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:44:35,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:44:35,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:44:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:44:37,254][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:44:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:44:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:44:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:44:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:44:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:44:41,571][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:44:42,290][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:44:43,010][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:44:43,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:44:44,449][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:44:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:44:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:44:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:44:47,609][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:44:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:44:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:44:49,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:44:50,490][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:44:51,209][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:44:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:44:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:44:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:44:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:44:54,808][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:44:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:44:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:44:56,968][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:44:57,688][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:44:58,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:44:59,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:45:00,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:45:00,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:45:00,360][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:45:01,726][__main__][INFO] - Iteration 522 took 55s (9.07% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 10m 38s. Estimated total time: 15h 31m 5s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 32s. [2026-03-25 22:45:01,729][__main__][INFO] - Starting iteration 522. [2026-03-25 22:45:01,734][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:45:01,734][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:45:06,633][__main__][INFO] - Number of regex retries in iteration 522: 0 [2026-03-25 22:45:06,634][__main__][INFO] - agents played in iteration 522 are Bob, Alice [2026-03-25 22:45:07,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:45:07,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:45:07,197][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:45:07,198][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:45:07,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:45:08,535][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:45:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:45:09,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:45:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:45:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:45:12,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:45:12,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:45:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:45:14,279][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:45:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:45:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:45:16,434][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:45:17,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:45:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:45:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:45:19,307][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:45:20,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:45:20,745][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:45:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:45:22,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:45:22,902][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:45:23,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:45:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:45:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:45:25,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:45:26,495][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:45:27,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:45:27,935][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:45:28,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:45:29,372][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:45:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:45:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:45:31,530][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:45:32,251][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:45:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:45:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:45:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:45:35,130][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:45:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:45:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:45:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:45:38,007][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:45:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:45:39,451][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:45:40,171][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:45:40,892][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:45:41,612][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:45:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:45:43,285][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:45:44,005][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:45:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:45:45,444][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:45:46,165][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:45:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:45:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:45:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:45:49,048][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:45:49,769][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:45:50,488][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:45:51,210][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:45:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:45:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:45:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:45:54,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:45:54,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:45:55,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:45:55,962][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:45:55,964][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:45:57,404][__main__][INFO] - Iteration 523 took 55s (8.80% Gen, 88.61% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 6m 29s. Estimated total time: 15h 27m 52s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s. [2026-03-25 22:45:57,407][__main__][INFO] - Starting iteration 523. [2026-03-25 22:45:57,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:45:57,412][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:46:02,939][__main__][INFO] - Number of regex retries in iteration 523: 0 [2026-03-25 22:46:02,940][__main__][INFO] - agents played in iteration 523 are Bob, Alice [2026-03-25 22:46:03,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:46:03,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:46:03,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:46:03,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:46:04,195][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:46:04,841][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:46:05,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:46:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:46:06,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:46:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:46:08,434][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:46:09,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:46:09,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:46:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:46:11,307][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:46:12,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:46:12,744][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:46:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:46:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:46:14,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:46:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:46:16,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:46:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:46:17,773][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:46:18,491][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:46:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:46:19,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:46:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:46:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:46:22,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:46:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:46:23,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:46:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:46:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:46:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:46:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:46:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:46:27,844][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:46:28,562][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:46:29,282][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:46:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:46:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:46:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:46:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:46:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:46:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:46:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:46:35,044][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:46:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:46:36,485][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:46:37,207][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:46:37,925][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:46:38,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:46:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:46:40,335][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:46:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:46:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:46:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:46:43,216][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:46:43,936][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:46:44,655][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:46:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:46:46,098][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:46:46,820][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:46:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:46:48,261][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:46:48,981][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:46:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:46:50,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:46:51,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:46:52,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:46:52,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:46:52,542][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:46:53,834][__main__][INFO] - Iteration 524 took 56s (9.80% Gen, 87.91% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 18m 5s. Estimated total time: 15h 40m 25s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 2s, 500 more iterations: 7h 50m 12s. [2026-03-25 22:46:53,837][__main__][INFO] - Starting iteration 524. [2026-03-25 22:46:53,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:46:53,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:46:58,780][__main__][INFO] - Number of regex retries in iteration 524: 0 [2026-03-25 22:46:58,781][__main__][INFO] - agents played in iteration 524 are Bob, Alice [2026-03-25 22:46:59,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:46:59,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:46:59,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:46:59,339][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:47:00,027][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:47:00,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:47:01,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:47:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:47:02,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:47:03,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:47:04,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:47:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:47:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:47:06,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:47:07,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:47:07,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:47:08,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:47:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:47:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:47:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:47:11,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:47:12,175][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:47:12,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:47:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:47:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:47:15,051][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:47:15,770][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:47:16,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:47:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:47:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:47:18,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:47:19,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:47:20,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:47:20,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:47:21,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:47:22,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:47:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:47:23,688][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:47:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:47:25,128][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:47:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:47:26,567][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:47:27,289][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:47:28,009][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:47:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:47:29,449][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:47:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:47:30,890][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:47:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:47:32,330][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:47:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:47:33,770][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:47:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:47:35,490][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:47:36,209][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:47:36,930][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:47:37,652][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:47:38,374][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:47:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:47:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:47:40,536][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:47:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:47:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:47:42,698][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:47:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:47:44,141][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:47:44,861][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:47:45,582][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:47:46,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:47:47,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:47:48,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:47:48,091][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:47:48,093][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:47:49,388][__main__][INFO] - Iteration 525 took 55s (8.89% Gen, 88.77% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 2m 34s. Estimated total time: 15h 25m 49s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 54s. [2026-03-25 22:47:49,390][__main__][INFO] - Starting iteration 525. [2026-03-25 22:47:49,394][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:47:49,395][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:47:54,336][__main__][INFO] - Number of regex retries in iteration 525: 0 [2026-03-25 22:47:54,337][__main__][INFO] - agents played in iteration 525 are Bob, Alice [2026-03-25 22:47:54,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:47:54,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:47:54,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:47:55,027][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:47:55,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:47:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:47:57,085][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:47:57,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:47:58,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:47:59,237][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:47:59,956][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:48:00,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:48:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:48:02,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:48:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:48:03,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:48:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:48:04,985][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:48:05,702][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:48:06,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:48:07,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:48:07,858][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:48:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:48:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:48:10,016][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:48:10,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:48:11,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:48:12,173][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:48:12,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:48:13,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:48:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:48:15,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:48:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:48:16,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:48:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:48:17,927][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:48:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:48:19,367][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:48:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:48:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:48:21,526][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:48:22,248][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:48:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:48:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:48:24,408][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:48:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:48:25,848][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:48:26,567][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:48:27,286][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:48:28,007][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:48:28,726][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:48:29,447][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:48:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:48:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:48:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:48:32,557][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:48:33,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:48:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:48:34,718][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:48:35,439][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:48:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:48:36,879][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:48:37,601][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:48:38,320][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:48:39,042][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:48:39,763][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:48:40,484][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:48:41,204][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:48:41,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:48:42,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:48:43,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:48:43,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:48:43,731][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:48:45,231][__main__][INFO] - Iteration 526 took 55s (8.85% Gen, 88.46% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 6m 27s. Estimated total time: 15h 30m 38s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 19s. [2026-03-25 22:48:45,234][__main__][INFO] - Starting iteration 526. [2026-03-25 22:48:45,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:48:45,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:48:53,742][__main__][INFO] - Number of regex retries in iteration 526: 0 [2026-03-25 22:48:53,743][__main__][INFO] - agents played in iteration 526 are Bob, Alice [2026-03-25 22:48:54,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:48:54,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:48:54,672][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:48:54,673][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:48:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:48:56,019][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:48:56,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:48:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:48:58,172][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:48:58,887][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:48:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:49:00,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:49:01,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:49:01,753][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:49:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:49:03,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:49:03,903][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:49:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:49:05,337][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:49:06,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:49:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:49:07,491][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:49:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:49:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:49:09,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:49:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:49:11,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:49:11,797][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:49:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:49:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:49:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:49:14,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:49:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:49:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:49:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:49:17,543][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:49:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:49:18,980][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:49:19,699][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:49:20,417][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:49:21,137][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:49:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:49:22,575][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:49:23,295][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:49:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:49:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:49:25,455][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:49:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:49:26,898][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:49:27,618][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:49:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:49:29,058][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:49:30,028][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:49:30,751][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:49:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:49:32,191][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:49:32,913][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:49:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:49:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:49:35,071][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:49:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:49:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:49:37,232][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:49:37,957][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:49:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:49:39,393][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:49:40,115][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:49:40,836][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:49:41,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:49:42,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:49:43,433][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:49:43,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:49:43,438][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:49:44,799][__main__][INFO] - Iteration 527 took 59s (14.28% Gen, 83.43% Train). Generation: 8s, Training: 49s. Estimated remaining time: 8h 7m 32s. Estimated total time: 16h 32m 43s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 16s, 500 more iterations: 8h 16m 21s. [2026-03-25 22:49:44,801][__main__][INFO] - Starting iteration 527. [2026-03-25 22:49:44,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:49:44,806][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:49:49,714][__main__][INFO] - Number of regex retries in iteration 527: 0 [2026-03-25 22:49:49,715][__main__][INFO] - agents played in iteration 527 are Bob, Alice [2026-03-25 22:49:50,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:49:50,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:49:50,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:49:50,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:49:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:49:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:49:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:49:53,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:49:53,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:49:54,524][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:49:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:49:55,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:49:56,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:49:57,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:49:58,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:49:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:49:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:50:00,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:50:00,987][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:50:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:50:02,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:50:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:50:03,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:50:04,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:50:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:50:06,017][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:50:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:50:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:50:08,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:50:08,894][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:50:09,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:50:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:50:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:50:11,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:50:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:50:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:50:13,929][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:50:14,649][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:50:15,367][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:50:16,088][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:50:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:50:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:50:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:50:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:50:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:50:20,404][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:50:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:50:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:50:22,564][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:50:23,284][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:50:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:50:24,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:50:25,702][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:50:26,421][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:50:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:50:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:50:28,582][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:50:29,303][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:50:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:50:30,742][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:50:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:50:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:50:32,903][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:50:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:50:34,343][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:50:35,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:50:35,785][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:50:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:50:37,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:50:37,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:50:38,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:50:38,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:50:38,978][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:50:40,310][__main__][INFO] - Iteration 528 took 55s (8.84% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 59m 0s. Estimated total time: 15h 25m 6s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 33s. [2026-03-25 22:50:40,314][__main__][INFO] - Starting iteration 528. [2026-03-25 22:50:40,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:50:40,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:50:45,301][__main__][INFO] - Number of regex retries in iteration 528: 0 [2026-03-25 22:50:45,302][__main__][INFO] - agents played in iteration 528 are Bob, Alice [2026-03-25 22:50:45,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:50:45,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:50:45,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:50:45,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:50:46,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:50:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:50:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:50:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:50:49,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:50:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:50:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:50:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:50:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:50:52,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:50:53,693][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:50:54,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:50:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:50:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:50:56,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:50:57,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:50:58,005][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:50:58,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:50:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:51:00,162][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:51:00,880][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:51:01,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:51:02,319][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:51:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:51:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:51:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:51:05,200][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:51:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:51:06,639][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:51:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:51:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:51:08,799][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:51:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:51:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:51:10,959][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:51:11,679][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:51:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:51:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:51:13,840][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:51:14,558][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:51:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:51:16,000][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:51:16,719][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:51:17,439][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:51:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:51:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:51:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:51:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:51:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:51:21,995][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:51:22,716][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:51:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:51:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:51:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:51:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:51:26,320][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:51:27,040][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:51:27,762][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:51:28,484][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:51:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:51:29,925][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:51:30,647][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:51:31,367][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:51:32,088][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:51:32,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:51:33,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:51:34,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:51:34,620][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:51:34,621][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:51:36,039][__main__][INFO] - Iteration 529 took 55s (8.94% Gen, 88.51% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 1m 39s. Estimated total time: 15h 28m 41s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 20s. [2026-03-25 22:51:36,041][__main__][INFO] - Starting iteration 529. [2026-03-25 22:51:36,046][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:51:36,046][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:51:40,979][__main__][INFO] - Number of regex retries in iteration 529: 0 [2026-03-25 22:51:40,980][__main__][INFO] - agents played in iteration 529 are Bob, Alice [2026-03-25 22:51:41,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:51:41,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:51:41,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:51:41,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:51:42,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:51:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:51:43,619][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:51:44,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:51:45,056][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:51:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:51:46,493][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:51:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:51:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:51:48,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:51:49,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:51:50,086][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:51:50,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:51:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:51:52,246][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:51:52,965][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:51:53,684][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:51:54,403][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:51:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:51:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:51:56,562][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:51:57,282][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:51:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:51:58,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:51:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:52:00,162][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:52:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:52:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:52:02,322][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:52:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:52:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:52:04,483][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:52:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:52:05,924][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:52:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:52:07,363][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:52:08,085][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:52:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:52:09,527][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:52:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:52:10,968][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:52:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:52:12,410][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:52:13,129][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:52:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:52:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:52:15,292][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:52:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:52:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:52:17,710][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:52:18,429][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:52:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:52:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:52:20,593][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:52:21,315][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:52:22,036][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:52:22,758][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:52:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:52:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:52:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:52:25,644][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:52:26,365][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:52:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:52:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:52:29,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:52:30,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 22:52:31,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:52:31,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:52:31,707][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:52:33,287][__main__][INFO] - Iteration 530 took 57s (8.62% Gen, 88.62% Train). Generation: 4s, Training: 50s. Estimated remaining time: 7h 26m 5s. Estimated total time: 15h 54m 4s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 24s, 500 more iterations: 7h 57m 2s. [2026-03-25 22:52:33,291][__main__][INFO] - Starting iteration 530. [2026-03-25 22:52:33,297][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:52:33,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:52:38,200][__main__][INFO] - Number of regex retries in iteration 530: 0 [2026-03-25 22:52:38,201][__main__][INFO] - agents played in iteration 530 are Bob, Alice [2026-03-25 22:52:38,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:52:38,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:52:38,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:52:38,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:52:39,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:52:40,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:52:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:52:41,567][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:52:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:52:43,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:52:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:52:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:52:45,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:52:45,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:52:46,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:52:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:52:48,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:52:48,750][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:52:49,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:52:50,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:52:50,909][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:52:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:52:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:52:53,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:52:53,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:52:54,502][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:52:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:52:55,939][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:52:56,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:52:57,378][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:52:58,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:52:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:52:59,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:53:00,257][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:53:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:53:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:53:02,420][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:53:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:53:03,862][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:53:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:53:05,302][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:53:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:53:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:53:07,462][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:53:08,184][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:53:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:53:09,624][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:53:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:53:11,066][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:53:11,788][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:53:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:53:13,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:53:14,206][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:53:14,929][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:53:15,650][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:53:16,377][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:53:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:53:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:53:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:53:19,266][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:53:19,988][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:53:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:53:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:53:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:53:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:53:23,599][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:53:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:53:25,044][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:53:25,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:53:26,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:53:27,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:53:27,683][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:53:27,685][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:53:29,097][__main__][INFO] - Iteration 531 took 55s (8.79% Gen, 88.68% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 1m 6s. Estimated total time: 15h 30m 1s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 0s. [2026-03-25 22:53:29,100][__main__][INFO] - Starting iteration 531. [2026-03-25 22:53:29,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:53:29,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:53:34,476][__main__][INFO] - Number of regex retries in iteration 531: 0 [2026-03-25 22:53:34,477][__main__][INFO] - agents played in iteration 531 are Bob, Alice [2026-03-25 22:53:34,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:53:35,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:53:35,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:53:35,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:53:35,730][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:53:36,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:53:37,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:53:37,816][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:53:38,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:53:39,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:53:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:53:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:53:41,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:53:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:53:42,847][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:53:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:53:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:53:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:53:45,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:53:46,442][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:53:47,163][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:53:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:53:48,601][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:53:49,322][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:53:50,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:53:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:53:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:53:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:53:52,919][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:53:53,637][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:53:54,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:53:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:53:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:53:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:53:57,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:53:57,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:53:58,678][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:53:59,399][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:54:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:54:00,840][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:54:01,559][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:54:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:54:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:54:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:54:04,443][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:54:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:54:05,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:54:06,606][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:54:07,327][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:54:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:54:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:54:09,491][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:54:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:54:11,168][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:54:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:54:12,610][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:54:13,331][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:54:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:54:14,772][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:54:15,494][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:54:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:54:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:54:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:54:18,378][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:54:19,098][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:54:19,819][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:54:20,541][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:54:21,262][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:54:21,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:54:22,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:54:23,787][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:54:23,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:54:23,791][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:54:25,141][__main__][INFO] - Iteration 532 took 56s (9.59% Gen, 88.00% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 4m 8s. Estimated total time: 15h 33m 59s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 59s. [2026-03-25 22:54:25,143][__main__][INFO] - Starting iteration 532. [2026-03-25 22:54:25,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:54:25,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:54:30,095][__main__][INFO] - Number of regex retries in iteration 532: 0 [2026-03-25 22:54:30,096][__main__][INFO] - agents played in iteration 532 are Bob, Alice [2026-03-25 22:54:30,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:54:30,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:54:30,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:54:30,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:54:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:54:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:54:32,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:54:33,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:54:34,190][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:54:34,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:54:35,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:54:36,347][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:54:37,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:54:37,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:54:38,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:54:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:54:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:54:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:54:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:54:42,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:54:42,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:54:43,544][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:54:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:54:44,982][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:54:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:54:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:54:47,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:54:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:54:48,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:54:49,303][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:54:50,023][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:54:50,744][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:54:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:54:52,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:54:52,905][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:54:53,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:54:54,346][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:54:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:54:55,787][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:54:56,509][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:54:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:54:57,951][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:54:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:54:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:55:00,115][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:55:00,836][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:55:01,556][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:55:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:55:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:55:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:55:04,442][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:55:05,164][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:55:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:55:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:55:07,581][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:55:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:55:09,026][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:55:09,747][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:55:10,467][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:55:11,190][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:55:11,911][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:55:12,632][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:55:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:55:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:55:14,798][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:55:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:55:16,243][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:55:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:55:17,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:55:18,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:55:19,455][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:55:19,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:55:19,459][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:55:21,008][__main__][INFO] - Iteration 533 took 55s (8.86% Gen, 88.36% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 0m 16s. Estimated total time: 15h 31m 2s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 31s. [2026-03-25 22:55:21,011][__main__][INFO] - Starting iteration 533. [2026-03-25 22:55:21,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:55:21,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:55:26,116][__main__][INFO] - Number of regex retries in iteration 533: 0 [2026-03-25 22:55:26,118][__main__][INFO] - agents played in iteration 533 are Bob, Alice [2026-03-25 22:55:26,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:55:26,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:55:26,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:55:26,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:55:27,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:55:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:55:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:55:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:55:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:55:32,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:55:35,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:55:35,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:55:36,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:55:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:55:38,041][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:55:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:55:39,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:55:40,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:55:40,913][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:55:41,631][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:55:42,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:55:43,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:55:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:55:44,505][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:55:45,224][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:55:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:55:46,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:55:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:55:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:55:48,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:55:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:55:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:55:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:55:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:55:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:55:53,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:55:53,852][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:55:54,571][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:55:55,289][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:55:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:55:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:55:57,450][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:55:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:55:58,889][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:55:59,610][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:56:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:56:01,049][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:56:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:56:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:56:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:56:03,928][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:56:04,650][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:56:05,628][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:56:06,349][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:56:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:56:07,790][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:56:08,511][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:56:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:56:09,954][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:56:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:56:11,396][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:56:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:56:12,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:56:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:56:14,282][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:56:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:56:15,723][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:56:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:56:17,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:56:17,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 22:56:19,108][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:56:19,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:56:19,114][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:56:20,486][__main__][INFO] - Iteration 534 took 59s (8.58% Gen, 89.11% Train). Generation: 5s, Training: 52s. Estimated remaining time: 7h 59m 26s. Estimated total time: 16h 31m 12s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 7s, 500 more iterations: 8h 15m 36s. [2026-03-25 22:56:20,489][__main__][INFO] - Starting iteration 534. [2026-03-25 22:56:20,493][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:56:20,494][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:56:25,403][__main__][INFO] - Number of regex retries in iteration 534: 0 [2026-03-25 22:56:25,404][__main__][INFO] - agents played in iteration 534 are Bob, Alice [2026-03-25 22:56:25,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:56:25,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:56:25,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:56:25,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:56:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:56:27,326][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:56:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:56:28,763][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:56:29,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:56:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:56:30,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:56:31,635][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:56:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:56:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:56:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:56:34,509][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:56:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:56:35,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:56:36,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:56:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:56:38,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:56:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:56:39,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:56:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:56:40,981][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:56:41,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:56:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:56:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:56:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:56:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:56:45,298][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:56:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:56:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:56:47,456][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:56:48,175][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:56:48,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:56:49,617][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:56:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:56:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:56:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:56:52,493][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:56:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:56:53,936][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:56:54,654][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:56:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:56:56,095][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:56:56,816][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:56:57,535][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:56:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:56:58,978][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:56:59,696][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:57:00,418][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:57:01,371][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:57:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:57:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:57:03,532][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:57:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:57:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:57:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:57:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:57:07,136][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:57:07,857][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:57:08,577][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:57:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:57:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:57:10,739][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:57:11,460][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:57:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:57:12,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:57:13,631][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:57:14,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:57:14,882][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:57:14,884][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:57:16,271][__main__][INFO] - Iteration 535 took 55s (8.80% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 56m 57s. Estimated total time: 15h 29m 39s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 49s. [2026-03-25 22:57:16,274][__main__][INFO] - Starting iteration 535. [2026-03-25 22:57:16,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:57:16,283][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:57:21,185][__main__][INFO] - Number of regex retries in iteration 535: 0 [2026-03-25 22:57:21,186][__main__][INFO] - agents played in iteration 535 are Bob, Alice [2026-03-25 22:57:21,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:57:21,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:57:21,756][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:57:21,756][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:57:22,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:57:23,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:57:23,814][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:57:24,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:57:25,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:57:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:57:26,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:57:27,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:57:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:57:28,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:57:29,558][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:57:30,277][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:57:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:57:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:57:32,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:57:33,152][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:57:33,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:57:34,587][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:57:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:57:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:57:36,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:57:37,465][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:57:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:57:38,903][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:57:39,622][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:57:40,342][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:57:41,061][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:57:41,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:57:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:57:43,219][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:57:43,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:57:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:57:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:57:46,096][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:57:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:57:47,537][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:57:48,255][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:57:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:57:49,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:57:50,415][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:57:51,133][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:57:51,856][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:57:52,575][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:57:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:57:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:57:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:57:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:57:56,175][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:57:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:57:57,862][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:57:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:57:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:58:00,023][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:58:00,742][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:58:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:58:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:58:02,908][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:58:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:58:04,348][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:58:05,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:58:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:58:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:58:07,231][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:58:07,952][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:58:08,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:58:09,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 22:58:10,711][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:58:10,716][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:58:10,718][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:58:12,037][__main__][INFO] - Iteration 536 took 55s (8.79% Gen, 88.83% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 55m 41s. Estimated total time: 15h 29m 19s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 39s. [2026-03-25 22:58:12,040][__main__][INFO] - Starting iteration 536. [2026-03-25 22:58:12,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:58:12,045][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:58:17,584][__main__][INFO] - Number of regex retries in iteration 536: 0 [2026-03-25 22:58:17,585][__main__][INFO] - agents played in iteration 536 are Bob, Alice [2026-03-25 22:58:18,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:58:18,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:58:18,159][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:58:18,160][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:58:18,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:58:19,513][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:58:20,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:58:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:58:21,664][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:58:22,384][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:58:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:58:23,820][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:58:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:58:25,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:58:25,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:58:26,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:58:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:58:28,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:58:28,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:58:29,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:58:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:58:31,004][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:58:31,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:58:32,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:58:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:58:33,879][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:58:34,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:58:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:58:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:58:36,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:58:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:58:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:58:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:58:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:58:40,355][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:58:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:58:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:58:42,514][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:58:43,233][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:58:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:58:44,672][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:58:45,391][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:58:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:58:46,831][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:58:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:58:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:58:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:58:49,711][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:58:50,430][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:58:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:58:51,870][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:58:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:58:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:58:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:58:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:58:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:58:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:58:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:58:57,893][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:58:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:58:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:59:00,057][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:59:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:59:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:59:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:59:02,939][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:59:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 22:59:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 22:59:05,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 22:59:05,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 22:59:06,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 22:59:06,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 22:59:06,946][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 22:59:08,283][__main__][INFO] - Iteration 537 took 56s (9.85% Gen, 87.77% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 2m 47s. Estimated total time: 15h 37m 21s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 40s. [2026-03-25 22:59:08,285][__main__][INFO] - Starting iteration 537. [2026-03-25 22:59:08,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 22:59:08,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 22:59:13,265][__main__][INFO] - Number of regex retries in iteration 537: 0 [2026-03-25 22:59:13,266][__main__][INFO] - agents played in iteration 537 are Bob, Alice [2026-03-25 22:59:13,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:59:13,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 22:59:13,831][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 22:59:13,832][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 22:59:14,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 22:59:15,176][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 22:59:15,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 22:59:16,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 22:59:17,331][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 22:59:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 22:59:18,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 22:59:19,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 22:59:20,200][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 22:59:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 22:59:21,636][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 22:59:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 22:59:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 22:59:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 22:59:24,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 22:59:25,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 22:59:25,948][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 22:59:26,666][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 22:59:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 22:59:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 22:59:28,822][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 22:59:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 22:59:30,258][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 22:59:30,978][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 22:59:31,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 22:59:32,416][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 22:59:33,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 22:59:33,853][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 22:59:34,573][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 22:59:35,292][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 22:59:36,012][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 22:59:36,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 22:59:37,449][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 22:59:38,169][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 22:59:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 22:59:39,610][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 22:59:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 22:59:41,050][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 22:59:41,772][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 22:59:42,492][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 22:59:43,212][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 22:59:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 22:59:44,653][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 22:59:45,372][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 22:59:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 22:59:46,813][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 22:59:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 22:59:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 22:59:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 22:59:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 22:59:50,642][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 22:59:51,363][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 22:59:52,083][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 22:59:52,803][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 22:59:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 22:59:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 22:59:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 22:59:55,685][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 22:59:56,406][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 22:59:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 22:59:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 22:59:58,567][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 22:59:59,289][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:00:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:00:00,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:00:01,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:00:02,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:00:02,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:00:02,614][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:00:03,982][__main__][INFO] - Iteration 538 took 55s (8.93% Gen, 88.61% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 52m 45s. Estimated total time: 15h 28m 14s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 7s. [2026-03-25 23:00:03,986][__main__][INFO] - Starting iteration 538. [2026-03-25 23:00:03,990][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:00:03,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:00:08,941][__main__][INFO] - Number of regex retries in iteration 538: 0 [2026-03-25 23:00:08,942][__main__][INFO] - agents played in iteration 538 are Bob, Alice [2026-03-25 23:00:09,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:00:09,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:00:09,513][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:00:09,514][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:00:10,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:00:10,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:00:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:00:12,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:00:13,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:00:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:00:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:00:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:00:15,888][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:00:16,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:00:17,325][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:00:18,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:00:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:00:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:00:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:00:20,914][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:00:21,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:00:22,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:00:23,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:00:23,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:00:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:00:25,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:00:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:00:26,662][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:00:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:00:28,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:00:28,818][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:00:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:00:30,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:00:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:00:31,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:00:32,414][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:00:33,133][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:00:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:00:34,573][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:00:35,293][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:00:36,012][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:00:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:00:37,452][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:00:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:00:38,891][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:00:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:00:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:00:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:00:41,770][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:00:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:00:43,211][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:00:43,930][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:00:44,903][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:00:45,624][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:00:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:00:47,064][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:00:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:00:48,507][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:00:49,227][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:00:49,948][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:00:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:00:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:00:52,108][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:00:52,828][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:00:53,548][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:00:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:00:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:00:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:00:56,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:00:57,228][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:00:58,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:00:58,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:00:58,313][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:00:59,619][__main__][INFO] - Iteration 539 took 55s (8.90% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 50m 45s. Estimated total time: 15h 27m 11s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 35s. [2026-03-25 23:00:59,621][__main__][INFO] - Starting iteration 539. [2026-03-25 23:00:59,625][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:00:59,626][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:01:04,590][__main__][INFO] - Number of regex retries in iteration 539: 0 [2026-03-25 23:01:04,591][__main__][INFO] - agents played in iteration 539 are Bob, Alice [2026-03-25 23:01:05,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:01:05,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:01:05,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:01:05,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:01:05,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:01:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:01:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:01:07,969][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:01:08,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:01:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:01:10,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:01:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:01:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:01:12,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:01:12,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:01:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:01:14,436][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:01:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:01:15,875][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:01:16,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:01:17,314][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:01:18,035][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:01:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:01:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:01:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:01:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:01:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:01:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:01:23,083][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:01:23,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:01:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:01:25,246][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:01:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:01:26,687][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:01:27,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:01:28,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:01:28,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:01:29,565][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:01:30,284][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:01:31,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:01:31,724][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:01:32,442][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:01:33,162][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:01:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:01:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:01:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:01:36,041][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:01:36,759][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:01:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:01:38,200][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:01:38,920][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:01:39,641][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:01:40,621][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:01:41,341][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:01:42,060][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:01:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:01:43,500][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:01:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:01:44,941][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:01:45,660][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:01:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:01:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:01:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:01:48,541][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:01:49,263][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:01:49,982][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:01:50,703][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:01:51,424][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:01:52,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:01:52,871][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:01:54,332][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:01:54,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:01:58,974][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:02:00,343][__main__][INFO] - Iteration 540 took 1m 0s (8.18% Gen, 89.56% Train). Generation: 4s, Training: 54s. Estimated remaining time: 8h 14m 34s. Estimated total time: 16h 52m 0s. Time estimates for 10 more iterations: 10m 7s, 100 more iterations: 1h 41m 12s, 500 more iterations: 8h 26m 0s. [2026-03-25 23:02:00,346][__main__][INFO] - Starting iteration 540. [2026-03-25 23:02:00,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:02:00,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:02:05,222][__main__][INFO] - Number of regex retries in iteration 540: 0 [2026-03-25 23:02:05,223][__main__][INFO] - agents played in iteration 540 are Bob, Alice [2026-03-25 23:02:05,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:02:05,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:02:05,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:02:05,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:02:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:02:07,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:02:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:02:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:02:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:02:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:02:10,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:02:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:02:12,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:02:12,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:02:13,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:02:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:02:15,104][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:02:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:02:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:02:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:02:17,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:02:18,686][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:02:19,405][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:02:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:02:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:02:21,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:02:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:02:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:02:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:02:24,430][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:02:25,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:02:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:02:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:02:27,298][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:02:28,018][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:02:28,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:02:29,453][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:02:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:02:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:02:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:02:32,326][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:02:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:02:33,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:02:34,482][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:02:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:02:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:02:36,637][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:02:37,355][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:02:38,073][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:02:38,792][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:02:39,510][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:02:40,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:02:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:02:41,905][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:02:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:02:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:02:44,060][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:02:44,780][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:02:45,499][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:02:46,220][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:02:46,938][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:02:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:02:48,377][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:02:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:02:49,816][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:02:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:02:51,255][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:02:51,975][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:02:52,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:02:53,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:02:54,570][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:02:54,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:02:54,575][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:02:56,183][__main__][INFO] - Iteration 541 took 55s (8.73% Gen, 88.39% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 52m 13s. Estimated total time: 15h 30m 35s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 17s. [2026-03-25 23:02:56,187][__main__][INFO] - Starting iteration 541. [2026-03-25 23:02:56,193][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:02:56,194][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:03:03,082][__main__][INFO] - Number of regex retries in iteration 541: 0 [2026-03-25 23:03:03,084][__main__][INFO] - agents played in iteration 541 are Bob, Alice [2026-03-25 23:03:03,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:03:03,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:03:03,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:03:03,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:03:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:03:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:03:05,717][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:03:06,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:03:07,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:03:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:03:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:03:09,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:03:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:03:10,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:03:11,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:03:12,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:03:12,886][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:03:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:03:14,320][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:03:15,037][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:03:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:03:16,473][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:03:17,190][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:03:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:03:18,625][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:03:19,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:03:20,062][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:03:20,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:03:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:03:22,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:03:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:03:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:03:24,367][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:03:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:03:25,804][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:03:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:03:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:03:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:03:28,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:03:29,395][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:03:30,112][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:03:30,832][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:03:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:03:32,268][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:03:32,986][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:03:33,704][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:03:34,425][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:03:35,144][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:03:35,861][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:03:36,581][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:03:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:03:38,019][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:03:38,968][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:03:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:03:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:03:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:03:41,846][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:03:42,564][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:03:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:03:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:03:44,722][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:03:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:03:46,162][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:03:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:03:47,601][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:03:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:03:49,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:03:49,759][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:03:50,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:03:51,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:03:52,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:03:52,499][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:03:52,500][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:03:53,871][__main__][INFO] - Iteration 542 took 57s (11.94% Gen, 85.67% Train). Generation: 6s, Training: 49s. Estimated remaining time: 7h 22m 2s. Estimated total time: 16h 1m 21s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 8s, 500 more iterations: 8h 0m 40s. [2026-03-25 23:03:53,874][__main__][INFO] - Starting iteration 542. [2026-03-25 23:03:53,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:03:53,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:03:58,792][__main__][INFO] - Number of regex retries in iteration 542: 0 [2026-03-25 23:03:58,794][__main__][INFO] - agents played in iteration 542 are Bob, Alice [2026-03-25 23:03:59,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:03:59,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:03:59,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:03:59,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:04:00,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:04:00,700][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:04:01,420][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:04:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:04:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:04:03,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:04:04,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:04:05,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:04:05,729][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:04:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:04:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:04:07,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:04:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:04:09,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:04:10,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:04:10,758][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:04:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:04:12,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:04:12,911][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:04:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:04:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:04:15,067][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:04:15,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:04:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:04:17,221][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:04:17,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:04:18,658][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:04:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:04:20,095][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:04:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:04:21,533][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:04:22,252][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:04:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:04:23,688][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:04:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:04:25,126][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:04:25,845][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:04:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:04:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:04:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:04:28,723][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:04:29,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:04:30,160][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:04:30,878][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:04:31,598][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:04:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:04:33,037][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:04:33,757][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:04:34,762][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:04:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:04:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:04:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:04:37,642][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:04:38,365][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:04:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:04:39,807][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:04:40,529][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:04:41,249][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:04:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:04:42,692][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:04:43,414][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:04:44,133][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:04:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:04:45,582][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:04:46,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:04:47,089][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:04:48,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:04:48,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:04:48,224][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:04:49,542][__main__][INFO] - Iteration 543 took 55s (8.83% Gen, 88.80% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 47m 31s. Estimated total time: 15h 27m 46s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 53s. [2026-03-25 23:04:49,547][__main__][INFO] - Starting iteration 543. [2026-03-25 23:04:49,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:04:49,554][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:04:54,527][__main__][INFO] - Number of regex retries in iteration 543: 0 [2026-03-25 23:04:54,528][__main__][INFO] - agents played in iteration 543 are Bob, Alice [2026-03-25 23:04:55,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:04:55,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:04:55,098][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:04:55,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:04:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:04:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:04:57,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:04:57,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:04:58,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:04:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:05:00,019][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:05:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:05:01,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:05:02,171][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:05:02,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:05:03,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:05:04,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:05:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:05:05,760][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:05:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:05:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:05:07,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:05:08,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:05:09,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:05:10,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:05:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:05:11,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:05:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:05:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:05:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:05:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:05:15,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:05:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:05:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:05:17,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:05:17,980][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:05:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:05:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:05:20,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:05:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:05:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:05:22,294][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:05:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:05:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:05:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:05:25,174][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:05:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:05:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:05:27,334][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:05:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:05:28,773][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:05:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:05:30,461][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:05:31,182][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:05:31,901][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:05:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:05:33,342][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:05:34,062][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:05:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:05:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:05:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:05:36,944][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:05:37,663][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:05:38,384][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:05:39,105][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:05:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:05:40,545][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:05:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:05:41,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:05:42,710][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:05:44,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:05:44,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:05:44,098][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:05:45,534][__main__][INFO] - Iteration 544 took 55s (8.89% Gen, 88.54% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 51m 52s. Estimated total time: 15h 33m 4s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 32s. [2026-03-25 23:05:45,538][__main__][INFO] - Starting iteration 544. [2026-03-25 23:05:45,544][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:05:45,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:05:51,123][__main__][INFO] - Number of regex retries in iteration 544: 0 [2026-03-25 23:05:51,125][__main__][INFO] - agents played in iteration 544 are Bob, Alice [2026-03-25 23:05:51,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:05:51,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:05:51,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:05:51,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:05:52,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:05:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:05:53,746][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:05:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:05:55,181][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:05:55,898][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:05:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:05:57,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:05:58,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:05:58,768][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:05:59,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:06:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:06:00,921][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:06:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:06:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:06:03,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:06:03,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:06:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:06:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:06:05,949][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:06:06,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:06:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:06:08,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:06:08,825][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:06:09,548][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:06:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:06:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:06:11,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:06:12,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:06:13,146][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:06:13,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:06:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:06:15,309][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:06:16,030][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:06:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:06:17,469][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:06:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:06:18,908][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:06:19,627][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:06:20,345][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:06:21,067][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:06:21,786][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:06:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:06:23,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:06:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:06:24,664][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:06:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:06:26,105][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:06:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:06:27,780][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:06:28,499][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:06:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:06:29,938][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:06:30,658][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:06:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:06:32,098][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:06:32,818][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:06:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:06:34,258][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:06:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:06:35,698][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:06:36,419][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:06:37,139][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:06:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:06:38,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:06:39,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:06:40,530][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:06:40,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:06:40,537][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:06:41,829][__main__][INFO] - Iteration 545 took 56s (9.91% Gen, 87.79% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 55m 59s. Estimated total time: 15h 38m 7s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 3s. [2026-03-25 23:06:41,832][__main__][INFO] - Starting iteration 545. [2026-03-25 23:06:41,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:06:41,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:06:46,794][__main__][INFO] - Number of regex retries in iteration 545: 0 [2026-03-25 23:06:46,795][__main__][INFO] - agents played in iteration 545 are Bob, Alice [2026-03-25 23:06:47,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:06:47,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:06:47,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:06:47,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:06:48,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:06:48,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:06:49,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:06:50,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:06:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:06:51,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:06:52,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:06:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:06:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:06:54,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:06:55,159][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:06:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:06:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:06:57,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:06:58,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:06:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:06:59,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:07:00,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:07:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:07:01,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:07:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:07:03,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:07:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:07:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:07:05,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:07:05,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:07:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:07:07,373][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:07:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:07:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:07:09,531][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:07:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:07:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:07:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:07:12,405][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:07:13,125][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:07:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:07:14,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:07:15,282][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:07:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:07:16,721][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:07:17,440][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:07:18,160][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:07:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:07:19,596][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:07:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:07:21,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:07:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:07:22,783][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:07:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:07:24,221][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:07:24,942][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:07:25,660][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:07:26,381][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:07:27,100][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:07:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:07:28,541][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:07:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:07:29,980][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:07:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:07:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:07:32,141][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:07:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:07:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:07:34,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:07:35,052][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:07:36,279][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:07:36,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:07:36,284][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:07:37,590][__main__][INFO] - Iteration 546 took 55s (8.89% Gen, 88.76% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 46m 13s. Estimated total time: 15h 29m 16s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 38s. [2026-03-25 23:07:37,593][__main__][INFO] - Starting iteration 546. [2026-03-25 23:07:37,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:07:37,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:07:42,574][__main__][INFO] - Number of regex retries in iteration 546: 0 [2026-03-25 23:07:42,575][__main__][INFO] - agents played in iteration 546 are Bob, Alice [2026-03-25 23:07:43,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:07:43,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:07:43,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:07:43,148][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:07:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:07:44,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:07:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:07:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:07:46,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:07:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:07:48,080][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:07:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:07:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:07:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:07:50,959][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:07:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:07:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:07:53,117][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:07:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:07:54,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:07:55,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:07:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:07:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:07:57,429][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:07:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:07:58,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:07:59,583][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:08:00,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:08:01,020][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:08:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:08:02,456][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:08:03,175][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:08:03,894][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:08:04,611][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:08:05,331][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:08:06,049][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:08:06,768][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:08:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:08:08,206][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:08:08,925][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:08:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:08:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:08:11,083][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:08:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:08:12,521][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:08:13,240][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:08:13,960][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:08:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:08:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:08:16,118][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:08:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:08:17,556][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:08:18,511][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:08:19,232][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:08:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:08:20,670][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:08:21,389][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:08:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:08:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:08:23,547][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:08:24,268][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:08:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:08:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:08:26,426][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:08:27,146][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:08:27,865][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:08:28,586][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:08:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:08:30,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:08:30,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:08:31,976][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:08:31,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:08:31,982][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:08:33,415][__main__][INFO] - Iteration 547 took 55s (8.92% Gen, 88.51% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 46m 20s. Estimated total time: 15h 30m 19s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 9s. [2026-03-25 23:08:33,418][__main__][INFO] - Starting iteration 547. [2026-03-25 23:08:33,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:08:33,423][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:08:38,333][__main__][INFO] - Number of regex retries in iteration 547: 0 [2026-03-25 23:08:38,334][__main__][INFO] - agents played in iteration 547 are Bob, Alice [2026-03-25 23:08:38,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:08:38,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:08:38,943][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:08:38,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:08:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:08:40,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:08:41,103][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:08:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:08:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:08:43,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:08:43,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:08:44,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:08:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:08:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:08:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:08:47,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:08:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:08:48,994][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:08:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:08:51,714][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:08:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:08:53,148][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:08:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:08:54,583][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:08:55,301][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:08:56,018][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:08:56,736][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:08:57,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:08:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:08:58,890][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:08:59,608][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:09:00,326][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:09:01,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:09:01,762][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:09:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:09:03,199][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:09:03,916][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:09:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:09:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:09:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:09:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:09:07,507][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:09:08,226][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:09:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:09:09,663][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:09:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:09:11,100][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:09:11,817][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:09:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:09:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:09:13,975][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:09:14,694][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:09:15,648][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:09:16,370][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:09:17,087][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:09:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:09:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:09:19,246][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:09:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:09:20,684][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:09:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:09:22,123][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:09:22,843][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:09:23,561][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:09:24,282][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:09:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:09:25,721][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:09:26,441][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:09:27,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:09:27,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-25 23:09:29,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:09:29,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:09:29,094][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:09:30,469][__main__][INFO] - Iteration 548 took 57s (8.61% Gen, 88.97% Train). Generation: 4s, Training: 50s. Estimated remaining time: 7h 5m 53s. Estimated total time: 15h 50m 49s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 4s, 500 more iterations: 7h 55m 24s. [2026-03-25 23:09:30,476][__main__][INFO] - Starting iteration 548. [2026-03-25 23:09:30,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:09:30,484][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:09:35,500][__main__][INFO] - Number of regex retries in iteration 548: 0 [2026-03-25 23:09:35,501][__main__][INFO] - agents played in iteration 548 are Bob, Alice [2026-03-25 23:09:36,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:09:36,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:09:36,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:09:36,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:09:36,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:09:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:09:38,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:09:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:09:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:09:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:09:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:09:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:09:42,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:09:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:09:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:09:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:09:45,409][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:09:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:09:46,847][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:09:47,567][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:09:48,283][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:09:49,003][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:09:49,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:09:50,439][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:09:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:09:51,874][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:09:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:09:53,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:09:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:09:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:09:55,466][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:09:56,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:09:56,901][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:09:57,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:09:58,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:09:59,057][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:09:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:10:00,496][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:10:01,215][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:10:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:10:02,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:10:03,371][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:10:04,090][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:10:04,809][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:10:05,527][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:10:06,248][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:10:06,967][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:10:07,685][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:10:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:10:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:10:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:10:10,563][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:10:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:10:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:10:13,040][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:10:13,759][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:10:14,478][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:10:15,198][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:10:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:10:16,637][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:10:17,358][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:10:18,075][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:10:18,794][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:10:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:10:20,233][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:10:20,954][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:10:21,675][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:10:22,394][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:10:23,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:10:23,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:10:25,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:10:25,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:10:25,330][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:10:27,252][__main__][INFO] - Iteration 549 took 56s (8.84% Gen, 87.77% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 0m 19s. Estimated total time: 15h 46m 12s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 37s, 500 more iterations: 7h 53m 6s. [2026-03-25 23:10:27,255][__main__][INFO] - Starting iteration 549. [2026-03-25 23:10:27,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:10:27,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:10:32,694][__main__][INFO] - Number of regex retries in iteration 549: 0 [2026-03-25 23:10:32,696][__main__][INFO] - agents played in iteration 549 are Bob, Alice [2026-03-25 23:10:33,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:10:33,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:10:33,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:10:33,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:10:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:10:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:10:35,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:10:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:10:36,754][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:10:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:10:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:10:38,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:10:39,621][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:10:40,338][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:10:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:10:41,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:10:42,488][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:10:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:10:43,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:10:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:10:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:10:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:10:46,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:10:47,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:10:48,228][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:10:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:10:49,662][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:10:50,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:10:51,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:10:51,814][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:10:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:10:53,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:10:53,968][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:10:54,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:10:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:10:56,121][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:10:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:10:57,559][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:10:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:10:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:10:59,715][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:11:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:11:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:11:01,870][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:11:02,590][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:11:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:11:04,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:11:04,746][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:11:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:11:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:11:06,902][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:11:07,620][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:11:08,573][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:11:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:11:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:11:10,729][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:11:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:11:12,167][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:11:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:11:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:11:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:11:15,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:11:15,767][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:11:16,488][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:11:17,206][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:11:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:11:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:11:19,365][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:11:20,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:11:20,818][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:11:22,135][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:11:22,139][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:11:22,141][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:11:23,551][__main__][INFO] - Iteration 550 took 56s (9.66% Gen, 87.83% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 51m 25s. Estimated total time: 15h 38m 14s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 49s, 500 more iterations: 7h 49m 7s. [2026-03-25 23:11:23,555][__main__][INFO] - Starting iteration 550. [2026-03-25 23:11:23,562][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2026-03-25 23:11:23,563][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:11:28,449][__main__][INFO] - Number of regex retries in iteration 550: 0 [2026-03-25 23:11:28,450][__main__][INFO] - agents played in iteration 550 are Bob, Alice [2026-03-25 23:11:28,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:11:29,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:11:29,017][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:11:29,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:11:29,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:11:30,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:11:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:11:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:11:32,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:11:33,229][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:11:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:11:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:11:35,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:11:36,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:11:36,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:11:37,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:11:38,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:11:38,963][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:11:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:11:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:11:41,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:11:41,834][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:11:42,551][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:11:43,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:11:43,986][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:11:44,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:11:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:11:46,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:11:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:11:47,579][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:11:48,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:11:49,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:11:49,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:11:50,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:11:51,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:11:51,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:11:52,608][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:11:53,326][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:11:54,045][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:11:54,763][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:11:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:11:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:11:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:11:57,642][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:11:58,361][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:11:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:11:59,797][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:12:00,516][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:12:01,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:12:01,953][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:12:02,672][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:12:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:12:04,347][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:12:05,066][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:12:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:12:06,503][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:12:07,223][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:12:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:12:08,661][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:12:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:12:10,102][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:12:10,820][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:12:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:12:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:12:12,978][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:12:13,697][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:12:14,418][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:12:15,135][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:12:15,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:12:16,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:12:17,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:12:17,869][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:12:17,870][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:12:21,270][__main__][INFO] - Iteration 551 took 57s (8.47% Gen, 85.63% Train). Generation: 4s, Training: 49s. Estimated remaining time: 7h 14m 4s. Estimated total time: 16h 1m 51s. Time estimates for 10 more iterations: 9m 37s, 100 more iterations: 1h 36m 11s, 500 more iterations: 8h 0m 55s. [2026-03-25 23:12:21,275][__main__][INFO] - Starting iteration 551. [2026-03-25 23:12:21,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:12:21,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:12:26,261][__main__][INFO] - Number of regex retries in iteration 551: 0 [2026-03-25 23:12:26,263][__main__][INFO] - agents played in iteration 551 are Bob, Alice [2026-03-25 23:12:26,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:12:26,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:12:26,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:12:26,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:12:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:12:28,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:12:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:12:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:12:30,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:12:31,039][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:12:31,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:12:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:12:33,188][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:12:33,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:12:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:12:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:12:36,054][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:12:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:12:37,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:12:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:12:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:12:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:12:40,362][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:12:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:12:41,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:12:42,513][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:12:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:12:43,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:12:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:12:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:12:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:12:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:12:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:12:48,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:12:48,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:12:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:12:50,412][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:12:51,131][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:12:51,849][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:12:52,567][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:12:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:12:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:12:54,723][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:12:55,440][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:12:56,159][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:12:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:12:57,597][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:12:58,316][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:12:59,034][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:12:59,753][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:13:00,470][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:13:01,190][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:13:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:13:02,874][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:13:03,594][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:13:04,312][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:13:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:13:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:13:06,469][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:13:07,187][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:13:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:13:08,626][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:13:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:13:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:13:10,785][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:13:11,503][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:13:12,225][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:13:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:13:13,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:13:14,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:13:15,723][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:13:15,727][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:13:15,729][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:13:17,054][__main__][INFO] - Iteration 552 took 55s (8.93% Gen, 88.69% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 40m 53s. Estimated total time: 15h 29m 36s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 48s. [2026-03-25 23:13:17,057][__main__][INFO] - Starting iteration 552. [2026-03-25 23:13:17,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:13:17,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:13:21,965][__main__][INFO] - Number of regex retries in iteration 552: 0 [2026-03-25 23:13:21,966][__main__][INFO] - agents played in iteration 552 are Bob, Alice [2026-03-25 23:13:22,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:13:22,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:13:22,525][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:13:22,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:13:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:13:23,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:13:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:13:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:13:26,014][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:13:26,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:13:27,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:13:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:13:28,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:13:29,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:13:30,317][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:13:31,034][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:13:31,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:13:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:13:33,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:13:33,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:13:34,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:13:35,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:13:36,055][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:13:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:13:37,490][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:13:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:13:38,927][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:13:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:13:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:13:41,082][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:13:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:13:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:13:43,236][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:13:43,955][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:13:44,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:13:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:13:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:13:46,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:13:47,546][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:13:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:13:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:13:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:13:50,421][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:13:51,140][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:13:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:13:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:13:53,296][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:13:54,015][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:13:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:13:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:13:56,171][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:13:56,890][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:13:57,874][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:13:58,593][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:13:59,313][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:14:00,031][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:14:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:14:01,469][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:14:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:14:02,908][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:14:03,627][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:14:04,346][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:14:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:14:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:14:06,506][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:14:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:14:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:14:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:14:09,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:14:10,114][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:14:11,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:14:11,391][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:14:11,393][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:14:12,756][__main__][INFO] - Iteration 553 took 55s (8.81% Gen, 88.74% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 38m 38s. Estimated total time: 15h 28m 16s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 8s. [2026-03-25 23:14:12,759][__main__][INFO] - Starting iteration 553. [2026-03-25 23:14:12,763][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:14:12,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:14:17,771][__main__][INFO] - Number of regex retries in iteration 553: 0 [2026-03-25 23:14:17,772][__main__][INFO] - agents played in iteration 553 are Bob, Alice [2026-03-25 23:14:18,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:14:18,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:14:18,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:14:18,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:14:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:14:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:14:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:14:21,117][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:14:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:14:22,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:14:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:14:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:14:24,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:14:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:14:26,135][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:14:26,851][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:14:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:14:28,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:14:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:14:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:14:30,438][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:14:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:14:31,874][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:14:32,591][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:14:33,309][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:14:34,027][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:14:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:14:35,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:14:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:14:36,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:14:37,615][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:14:38,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:14:39,053][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:14:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:14:40,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:14:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:14:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:14:42,644][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:14:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:14:44,082][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:14:44,799][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:14:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:14:46,235][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:14:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:14:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:14:48,392][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:14:49,111][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:14:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:14:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:14:51,268][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:14:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:14:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:14:53,656][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:14:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:14:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:14:55,813][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:14:56,534][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:14:57,253][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:14:57,971][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:14:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:14:59,411][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:15:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:15:00,849][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:15:01,569][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:15:02,288][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:15:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:15:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:15:04,446][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:15:05,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:15:05,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:15:07,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:15:07,140][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:15:07,144][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:15:08,522][__main__][INFO] - Iteration 554 took 55s (8.98% Gen, 88.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 38m 47s. Estimated total time: 15h 29m 21s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 40s. [2026-03-25 23:15:08,525][__main__][INFO] - Starting iteration 554. [2026-03-25 23:15:08,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:15:08,530][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:15:13,921][__main__][INFO] - Number of regex retries in iteration 554: 0 [2026-03-25 23:15:13,922][__main__][INFO] - agents played in iteration 554 are Bob, Alice [2026-03-25 23:15:14,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:15:14,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:15:14,506][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:15:14,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:15:15,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:15:15,849][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:15:16,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:15:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:15:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:15:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:15:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:15:20,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:15:20,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:15:21,586][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:15:22,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:15:23,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:15:23,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:15:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:15:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:15:25,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:15:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:15:27,326][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:15:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:15:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:15:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:15:30,195][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:15:30,913][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:15:31,632][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:15:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:15:33,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:15:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:15:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:15:35,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:15:35,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:15:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:15:37,374][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:15:38,093][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:15:38,812][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:15:39,530][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:15:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:15:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:15:41,685][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:15:42,405][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:15:43,122][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:15:43,842][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:15:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:15:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:15:45,999][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:15:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:15:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:15:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:15:48,874][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:15:49,825][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:15:50,545][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:15:51,264][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:15:51,983][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:15:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:15:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:15:54,140][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:15:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:15:55,579][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:15:56,297][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:15:57,017][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:15:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:15:58,458][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:15:59,178][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:15:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:16:00,616][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:16:01,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:16:02,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:16:03,472][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:16:03,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:16:03,479][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:16:04,768][__main__][INFO] - Iteration 555 took 56s (9.59% Gen, 88.12% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 45m 50s. Estimated total time: 15h 37m 20s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 40s. [2026-03-25 23:16:04,770][__main__][INFO] - Starting iteration 555. [2026-03-25 23:16:04,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:16:04,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:16:09,681][__main__][INFO] - Number of regex retries in iteration 555: 0 [2026-03-25 23:16:09,682][__main__][INFO] - agents played in iteration 555 are Bob, Alice [2026-03-25 23:16:10,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:16:10,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:16:10,326][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:16:10,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:16:11,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:16:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:16:12,403][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:16:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:16:13,835][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:16:14,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:16:15,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:16:15,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:16:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:16:17,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:16:18,138][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:16:18,855][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:16:19,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:16:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:16:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:16:21,727][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:16:22,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:16:23,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:16:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:16:24,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:16:25,316][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:16:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:16:26,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:16:27,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:16:28,189][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:16:28,906][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:16:29,625][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:16:30,343][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:16:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:16:31,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:16:32,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:16:33,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:16:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:16:34,654][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:16:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:16:36,091][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:16:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:16:37,528][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:16:38,248][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:16:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:16:39,686][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:16:40,405][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:16:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:16:41,843][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:16:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:16:43,282][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:16:44,001][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:16:44,720][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:16:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:16:46,420][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:16:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:16:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:16:48,577][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:16:49,296][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:16:50,014][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:16:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:16:51,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:16:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:16:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:16:53,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:16:54,331][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:16:55,050][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:16:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:16:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:16:57,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:16:57,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:16:59,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:16:59,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:16:59,336][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:17:00,717][__main__][INFO] - Iteration 556 took 55s (8.77% Gen, 88.76% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 39m 58s. Estimated total time: 15h 32m 25s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 12s. [2026-03-25 23:17:00,720][__main__][INFO] - Starting iteration 556. [2026-03-25 23:17:00,724][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:17:00,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:17:05,702][__main__][INFO] - Number of regex retries in iteration 556: 0 [2026-03-25 23:17:05,753][__main__][INFO] - agents played in iteration 556 are Bob, Alice [2026-03-25 23:17:06,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:17:06,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:17:06,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:17:06,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:17:07,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:17:07,760][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:17:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:17:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:17:09,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:17:10,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:17:11,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:17:12,075][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:17:12,792][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:17:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:17:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:17:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:17:15,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:17:16,386][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:17:17,103][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:17:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:17:18,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:17:19,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:17:19,976][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:17:20,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:17:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:17:22,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:17:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:17:23,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:17:24,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:17:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:17:25,720][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:17:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:17:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:17:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:17:28,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:17:29,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:17:30,032][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:17:30,750][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:17:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:17:32,186][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:17:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:17:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:17:34,342][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:17:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:17:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:17:36,497][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:17:37,216][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:17:37,934][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:17:38,654][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:17:39,375][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:17:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:17:40,814][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:17:41,760][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:17:42,480][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:17:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:17:43,916][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:17:44,635][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:17:45,355][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:17:46,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:17:46,792][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:17:47,513][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:17:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:17:48,951][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:17:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:17:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:17:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:17:51,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:17:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:17:53,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:17:54,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:17:55,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:17:55,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:17:55,403][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:17:57,088][__main__][INFO] - Iteration 557 took 56s (8.92% Gen, 88.09% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 46m 2s. Estimated total time: 15h 39m 25s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 56s, 500 more iterations: 7h 49m 42s. [2026-03-25 23:17:57,091][__main__][INFO] - Starting iteration 557. [2026-03-25 23:17:57,095][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:17:57,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:18:02,167][__main__][INFO] - Number of regex retries in iteration 557: 0 [2026-03-25 23:18:02,168][__main__][INFO] - agents played in iteration 557 are Bob, Alice [2026-03-25 23:18:02,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:18:02,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:18:02,735][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:18:02,736][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:18:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:18:04,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:18:04,800][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:18:05,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:18:06,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:18:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:18:07,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:18:08,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:18:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:18:09,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:18:10,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:18:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:18:11,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:18:12,686][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:18:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:18:14,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:18:14,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:18:15,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:18:16,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:18:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:18:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:18:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:18:19,150][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:18:19,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:18:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:18:21,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:18:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:18:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:18:23,462][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:18:24,180][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:18:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:18:25,616][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:18:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:18:27,053][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:18:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:18:28,492][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:18:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:18:29,931][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:18:30,652][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:18:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:18:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:18:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:18:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:18:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:18:34,963][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:18:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:18:36,401][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:18:37,119][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:18:38,148][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:18:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:18:39,586][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:18:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:18:41,028][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:18:41,749][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:18:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:18:43,192][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:18:43,915][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:18:44,638][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:18:45,359][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:18:46,082][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:18:46,803][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:18:47,525][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:18:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:18:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:18:49,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:18:50,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:18:51,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:18:51,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:18:51,523][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:18:53,729][__main__][INFO] - Iteration 558 took 56s (8.96% Gen, 87.14% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 49m 37s. Estimated total time: 15h 43m 56s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 23s, 500 more iterations: 7h 51m 58s. [2026-03-25 23:18:53,732][__main__][INFO] - Starting iteration 558. [2026-03-25 23:18:53,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:18:53,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:18:58,710][__main__][INFO] - Number of regex retries in iteration 558: 0 [2026-03-25 23:18:58,711][__main__][INFO] - agents played in iteration 558 are Bob, Alice [2026-03-25 23:18:59,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:18:59,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:18:59,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:18:59,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:18:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:19:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:19:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:19:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:19:02,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:19:03,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:19:04,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:19:04,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:19:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:19:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:19:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:19:07,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:19:08,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:19:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:19:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:19:10,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:19:11,367][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:19:12,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:19:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:19:13,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:19:14,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:19:14,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:19:15,672][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:19:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:19:17,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:19:17,827][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:19:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:19:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:19:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:19:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:19:21,416][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:19:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:19:22,852][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:19:23,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:19:24,288][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:19:25,007][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:19:25,723][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:19:26,444][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:19:27,161][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:19:27,884][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:19:28,608][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:19:29,328][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:19:30,049][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:19:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:19:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:19:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:19:32,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:19:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:19:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:19:35,350][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:19:36,071][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:19:36,791][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:19:37,512][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:19:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:19:38,957][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:19:39,679][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:19:40,401][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:19:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:19:41,848][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:19:42,570][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:19:43,291][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:19:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:19:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:19:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:19:46,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:19:46,928][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:19:47,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:19:47,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:19:47,964][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:19:49,326][__main__][INFO] - Iteration 559 took 55s (8.95% Gen, 88.60% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 31m 16s. Estimated total time: 15h 26m 31s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 15s. [2026-03-25 23:19:49,329][__main__][INFO] - Starting iteration 559. [2026-03-25 23:19:49,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:19:49,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:19:54,452][__main__][INFO] - Number of regex retries in iteration 559: 0 [2026-03-25 23:19:54,454][__main__][INFO] - agents played in iteration 559 are Bob, Alice [2026-03-25 23:19:54,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:19:55,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:19:55,049][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:19:55,049][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:19:55,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:19:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:19:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:19:57,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:19:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:19:59,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:19:59,967][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:20:00,684][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:20:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:20:02,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:20:02,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:20:03,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:20:04,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:20:04,990][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:20:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:20:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:20:07,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:20:07,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:20:08,577][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:20:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:20:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:20:10,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:20:11,451][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:20:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:20:12,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:20:13,605][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:20:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:20:15,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:20:15,762][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:20:16,482][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:20:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:20:17,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:20:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:20:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:20:20,075][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:20:20,795][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:20:21,513][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:20:22,231][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:20:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:20:23,668][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:20:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:20:25,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:20:25,825][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:20:26,547][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:20:27,265][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:20:27,984][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:20:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:20:29,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:20:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:20:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:20:31,812][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:20:32,530][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:20:33,251][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:20:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:20:34,690][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:20:35,409][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:20:36,131][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:20:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:20:37,570][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:20:38,290][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:20:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:20:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:20:40,451][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:20:41,171][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:20:41,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:20:42,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:20:44,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:20:44,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:20:44,060][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:20:45,447][__main__][INFO] - Iteration 560 took 56s (9.11% Gen, 88.42% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 38m 55s. Estimated total time: 15h 35m 6s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 30s, 500 more iterations: 7h 47m 33s. [2026-03-25 23:20:45,451][__main__][INFO] - Starting iteration 560. [2026-03-25 23:20:45,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:20:45,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:20:50,565][__main__][INFO] - Number of regex retries in iteration 560: 0 [2026-03-25 23:20:50,567][__main__][INFO] - agents played in iteration 560 are Bob, Alice [2026-03-25 23:20:51,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:20:51,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:20:51,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:20:51,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:20:51,836][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:20:52,489][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:20:53,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:20:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:20:54,641][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:20:55,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:20:56,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:20:56,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:20:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:20:58,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:20:58,948][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:20:59,665][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:21:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:21:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:21:01,819][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:21:02,535][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:21:03,253][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:21:03,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:21:04,691][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:21:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:21:06,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:21:06,845][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:21:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:21:08,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:21:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:21:09,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:21:10,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:21:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:21:11,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:21:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:21:13,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:21:14,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:21:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:21:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:21:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:21:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:21:17,621][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:21:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:21:19,058][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:21:19,777][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:21:20,497][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:21:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:21:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:21:22,654][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:21:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:21:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:21:24,813][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:21:25,531][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:21:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:21:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:21:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:21:28,712][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:21:29,431][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:21:30,149][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:21:30,870][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:21:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:21:32,307][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:21:33,028][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:21:33,746][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:21:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:21:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:21:35,906][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:21:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:21:37,346][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:21:38,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:21:38,822][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:21:39,894][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:21:39,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:21:39,899][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:21:41,263][__main__][INFO] - Iteration 561 took 55s (9.15% Gen, 88.40% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 32m 58s. Estimated total time: 15h 30m 5s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 2s. [2026-03-25 23:21:41,266][__main__][INFO] - Starting iteration 561. [2026-03-25 23:21:41,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:21:41,271][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:21:48,076][__main__][INFO] - Number of regex retries in iteration 561: 0 [2026-03-25 23:21:48,078][__main__][INFO] - agents played in iteration 561 are Bob, Alice [2026-03-25 23:21:48,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:21:48,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:21:48,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:21:48,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:21:49,328][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:21:49,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:21:50,695][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:21:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:21:52,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:21:52,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:21:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:21:54,277][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:21:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:21:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:21:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:21:57,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:21:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:21:58,578][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:21:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:22:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:22:00,729][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:22:01,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:22:02,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:22:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:22:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:22:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:22:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:22:05,751][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:22:06,469][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:22:07,188][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:22:07,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:22:08,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:22:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:22:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:22:10,780][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:22:11,497][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:22:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:22:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:22:13,652][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:22:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:22:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:22:15,808][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:22:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:22:17,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:22:17,965][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:22:18,683][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:22:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:22:20,122][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:22:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:22:21,562][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:22:22,282][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:22:23,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:22:23,954][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:22:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:22:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:22:26,112][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:22:26,830][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:22:27,549][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:22:28,269][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:22:28,986][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:22:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:22:30,426][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:22:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:22:31,865][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:22:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:22:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:22:34,024][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:22:34,744][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:22:35,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:22:36,189][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:22:37,305][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:22:37,308][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:22:37,310][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:22:39,422][__main__][INFO] - Iteration 562 took 58s (11.70% Gen, 84.66% Train). Generation: 6s, Training: 49s. Estimated remaining time: 7h 11m 9s. Estimated total time: 16h 9m 14s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 55s, 500 more iterations: 8h 4m 37s. [2026-03-25 23:22:39,426][__main__][INFO] - Starting iteration 562. [2026-03-25 23:22:39,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:22:39,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:22:44,647][__main__][INFO] - Number of regex retries in iteration 562: 0 [2026-03-25 23:22:44,649][__main__][INFO] - agents played in iteration 562 are Bob, Alice [2026-03-25 23:22:45,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:22:45,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:22:45,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:22:45,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:22:46,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:22:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:22:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:22:48,258][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:22:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:22:49,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:22:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:22:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:22:51,842][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:22:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:22:53,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:22:53,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:22:54,710][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:22:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:22:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:22:56,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:22:57,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:22:58,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:22:59,017][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:22:59,733][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:23:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:23:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:23:01,887][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:23:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:23:03,321][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:23:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:23:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:23:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:23:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:23:06,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:23:07,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:23:08,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:23:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:23:09,871][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:23:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:23:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:23:12,028][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:23:12,747][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:23:13,467][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:23:14,185][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:23:14,905][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:23:15,622][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:23:16,341][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:23:17,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:23:17,778][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:23:18,497][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:23:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:23:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:23:20,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:23:21,612][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:23:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:23:23,052][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:23:23,770][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:23:24,490][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:23:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:23:25,928][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:23:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:23:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:23:28,087][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:23:28,806][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:23:29,526][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:23:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:23:30,965][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:23:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:23:32,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:23:33,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:23:34,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:23:34,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:23:34,488][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:23:35,955][__main__][INFO] - Iteration 563 took 56s (9.22% Gen, 88.17% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 43m 2s. Estimated total time: 15h 42m 4s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 12s, 500 more iterations: 7h 51m 2s. [2026-03-25 23:23:35,965][__main__][INFO] - Starting iteration 563. [2026-03-25 23:23:35,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:23:35,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:23:40,987][__main__][INFO] - Number of regex retries in iteration 563: 0 [2026-03-25 23:23:40,988][__main__][INFO] - agents played in iteration 563 are Bob, Alice [2026-03-25 23:23:41,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:23:41,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:23:41,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:23:41,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:23:42,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:23:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:23:43,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:23:44,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:23:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:23:45,896][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:23:46,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:23:47,330][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:23:48,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:23:48,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:23:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:23:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:23:50,918][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:23:51,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:23:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:23:53,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:23:53,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:23:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:23:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:23:55,939][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:23:56,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:23:57,374][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:23:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:23:58,810][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:23:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:24:00,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:24:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:24:01,690][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:24:02,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:24:03,127][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:24:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:24:04,563][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:24:05,283][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:24:06,001][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:24:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:24:07,437][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:24:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:24:08,874][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:24:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:24:10,311][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:24:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:24:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:24:12,467][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:24:13,186][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:24:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:24:14,623][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:24:15,344][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:24:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:24:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:24:17,816][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:24:18,534][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:24:19,254][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:24:19,972][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:24:20,691][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:24:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:24:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:24:22,852][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:24:23,571][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:24:24,290][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:24:25,011][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:24:25,729][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:24:26,448][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:24:27,169][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:24:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:24:28,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:24:29,368][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:24:30,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:24:30,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:24:30,551][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:24:32,138][__main__][INFO] - Iteration 564 took 56s (8.92% Gen, 88.25% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 36m 6s. Estimated total time: 15h 36m 3s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 36s, 500 more iterations: 7h 48m 1s. [2026-03-25 23:24:32,142][__main__][INFO] - Starting iteration 564. [2026-03-25 23:24:32,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:24:32,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:24:39,388][__main__][INFO] - Number of regex retries in iteration 564: 0 [2026-03-25 23:24:39,389][__main__][INFO] - agents played in iteration 564 are Bob, Alice [2026-03-25 23:24:39,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:24:40,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:24:40,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:24:40,005][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:24:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:24:41,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:24:42,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:24:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:24:43,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:24:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:24:44,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:24:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:24:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:24:47,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:24:47,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:24:48,502][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:24:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:24:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:24:50,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:24:51,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:24:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:24:52,805][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:24:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:24:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:24:54,957][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:24:55,674][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:24:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:24:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:24:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:24:58,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:24:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:24:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:25:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:25:01,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:25:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:25:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:25:03,574][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:25:04,291][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:25:05,010][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:25:05,728][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:25:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:25:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:25:07,881][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:25:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:25:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:25:10,039][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:25:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:25:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:25:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:25:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:25:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:25:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:25:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:25:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:25:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:25:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:25:18,178][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:25:18,896][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:25:19,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:25:20,334][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:25:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:25:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:25:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:25:23,209][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:25:23,929][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:25:24,648][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:25:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:25:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:25:26,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:25:27,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:25:28,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:25:28,668][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:25:28,670][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:25:30,161][__main__][INFO] - Iteration 565 took 58s (12.48% Gen, 84.95% Train). Generation: 7s, Training: 49s. Estimated remaining time: 7h 5m 59s. Estimated total time: 16h 6m 54s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 41s, 500 more iterations: 8h 3m 27s. [2026-03-25 23:25:30,164][__main__][INFO] - Starting iteration 565. [2026-03-25 23:25:30,168][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:25:30,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:25:35,130][__main__][INFO] - Number of regex retries in iteration 565: 0 [2026-03-25 23:25:35,131][__main__][INFO] - agents played in iteration 565 are Bob, Alice [2026-03-25 23:25:35,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:25:35,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:25:35,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:25:35,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:25:36,420][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:25:37,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:25:37,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:25:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:25:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:25:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:25:40,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:25:41,377][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:25:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:25:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:25:43,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:25:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:25:44,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:25:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:25:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:25:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:25:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:25:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:25:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:25:49,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:25:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:25:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:25:52,139][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:25:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:25:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:25:54,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:25:55,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:25:55,729][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:25:56,446][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:25:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:25:57,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:25:58,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:25:59,320][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:26:00,037][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:26:00,757][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:26:01,475][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:26:02,192][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:26:02,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:26:03,630][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:26:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:26:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:26:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:26:06,507][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:26:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:26:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:26:08,663][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:26:09,382][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:26:10,101][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:26:11,048][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:26:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:26:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:26:13,206][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:26:13,925][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:26:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:26:15,364][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:26:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:26:16,803][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:26:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:26:18,242][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:26:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:26:19,681][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:26:20,399][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:26:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:26:21,839][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:26:22,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:26:23,293][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:26:24,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:26:24,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:26:24,695][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:26:26,168][__main__][INFO] - Iteration 566 took 56s (8.86% Gen, 88.50% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 31m 30s. Estimated total time: 15h 33m 22s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 41s. [2026-03-25 23:26:26,170][__main__][INFO] - Starting iteration 566. [2026-03-25 23:26:26,174][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:26:26,175][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:26:31,249][__main__][INFO] - Number of regex retries in iteration 566: 0 [2026-03-25 23:26:31,249][__main__][INFO] - agents played in iteration 566 are Bob, Alice [2026-03-25 23:26:31,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:26:31,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:26:31,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:26:31,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:26:32,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:26:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:26:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:26:34,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:26:35,298][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:26:36,015][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:26:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:26:37,449][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:26:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:26:38,886][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:26:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:26:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:26:41,037][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:26:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:26:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:26:43,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:26:43,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:26:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:26:45,340][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:26:46,057][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:26:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:26:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:26:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:26:48,928][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:26:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:26:50,365][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:26:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:26:51,800][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:26:52,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:26:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:26:53,957][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:26:54,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:26:55,392][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:26:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:26:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:26:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:26:58,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:26:58,984][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:26:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:27:00,419][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:27:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:27:01,857][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:27:02,575][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:27:03,294][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:27:04,011][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:27:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:27:05,450][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:27:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:27:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:27:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:27:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:27:09,349][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:27:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:27:10,786][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:27:11,506][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:27:12,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:27:12,943][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:27:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:27:14,381][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:27:15,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:27:15,818][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:27:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:27:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:27:17,976][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:27:18,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:27:19,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:27:20,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:27:20,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:27:20,554][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:27:21,956][__main__][INFO] - Iteration 567 took 55s (9.10% Gen, 88.39% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 26m 56s. Estimated total time: 15h 29m 43s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 51s. [2026-03-25 23:27:21,960][__main__][INFO] - Starting iteration 567. [2026-03-25 23:27:21,964][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:27:21,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:27:28,184][__main__][INFO] - Number of regex retries in iteration 567: 0 [2026-03-25 23:27:28,185][__main__][INFO] - agents played in iteration 567 are Bob, Alice [2026-03-25 23:27:28,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:27:28,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:27:28,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:27:28,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:27:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:27:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:27:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:27:31,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:27:32,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:27:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:27:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:27:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:27:35,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:27:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:27:36,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:27:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:27:38,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:27:38,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:27:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:27:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:27:40,876][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:27:41,594][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:27:42,313][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:27:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:27:43,747][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:27:44,465][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:27:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:27:45,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:27:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:27:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:27:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:27:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:27:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:27:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:27:50,924][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:27:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:27:52,360][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:27:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:27:53,797][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:27:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:27:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:27:55,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:27:56,669][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:27:57,387][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:27:58,106][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:27:58,825][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:27:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:28:00,262][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:28:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:28:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:28:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:28:03,137][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:28:04,093][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:28:04,813][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:28:05,531][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:28:06,250][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:28:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:28:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:28:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:28:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:28:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:28:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:28:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:28:12,004][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:28:12,722][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:28:13,441][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:28:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:28:14,880][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:28:15,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:28:16,324][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:28:17,423][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:28:17,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:28:17,429][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:28:18,815][__main__][INFO] - Iteration 568 took 56s (10.94% Gen, 86.62% Train). Generation: 6s, Training: 49s. Estimated remaining time: 6h 43m 48s. Estimated total time: 15h 47m 32s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 45s, 500 more iterations: 7h 53m 46s. [2026-03-25 23:28:18,817][__main__][INFO] - Starting iteration 568. [2026-03-25 23:28:18,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:28:18,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:28:26,575][__main__][INFO] - Number of regex retries in iteration 568: 0 [2026-03-25 23:28:26,576][__main__][INFO] - agents played in iteration 568 are Bob, Alice [2026-03-25 23:28:27,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:28:27,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:28:27,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:28:27,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:28:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:28:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:28:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:28:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:28:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:28:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:28:32,063][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:28:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:28:33,498][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:28:34,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:28:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:28:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:28:36,366][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:28:37,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:28:37,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:28:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:28:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:28:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:28:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:28:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:28:42,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:28:42,815][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:28:43,532][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:28:44,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:28:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:28:45,683][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:28:46,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:28:47,117][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:28:47,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:28:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:28:49,269][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:28:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:28:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:28:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:28:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:28:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:28:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:28:54,291][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:28:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:28:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:28:56,444][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:28:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:28:57,883][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:28:58,604][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:28:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:29:00,040][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:29:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:29:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:29:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:29:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:29:03,868][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:29:04,587][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:29:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:29:06,024][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:29:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:29:07,460][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:29:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:29:08,897][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:29:09,617][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:29:10,335][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:29:11,053][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:29:11,773][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:29:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:29:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:29:13,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:29:14,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:29:15,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:29:15,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:29:15,841][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:29:17,214][__main__][INFO] - Iteration 569 took 58s (13.28% Gen, 84.37% Train). Generation: 7s, Training: 49s. Estimated remaining time: 7h 8m 31s. Estimated total time: 16h 13m 14s. Time estimates for 10 more iterations: 9m 43s, 100 more iterations: 1h 37m 19s, 500 more iterations: 8h 6m 37s. [2026-03-25 23:29:17,216][__main__][INFO] - Starting iteration 569. [2026-03-25 23:29:17,222][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:29:17,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:29:22,297][__main__][INFO] - Number of regex retries in iteration 569: 0 [2026-03-25 23:29:22,298][__main__][INFO] - agents played in iteration 569 are Bob, Alice [2026-03-25 23:29:22,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:29:22,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:29:22,869][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:29:22,870][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:29:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:29:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:29:24,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:29:25,637][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:29:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:29:27,071][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:29:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:29:28,508][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:29:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:29:29,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:29:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:29:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:29:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:29:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:29:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:29:34,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:29:34,965][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:29:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:29:36,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:29:37,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:29:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:29:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:29:39,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:29:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:29:40,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:29:41,424][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:29:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:29:42,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:29:43,578][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:29:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:29:45,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:29:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:29:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:29:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:29:47,885][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:29:48,604][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:29:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:29:50,040][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:29:50,758][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:29:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:29:52,196][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:29:52,915][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:29:53,633][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:29:54,351][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:29:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:29:55,789][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:29:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:29:57,227][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:29:58,239][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:29:58,957][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:29:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:30:00,395][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:30:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:30:01,833][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:30:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:30:03,272][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:30:03,991][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:30:04,708][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:30:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:30:06,147][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:30:06,865][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:30:07,585][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:30:08,303][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:30:09,023][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:30:09,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:30:10,483][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:30:11,541][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:30:11,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:30:11,546][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:30:12,862][__main__][INFO] - Iteration 570 took 55s (9.12% Gen, 88.51% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 21m 43s. Estimated total time: 15h 27m 22s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 41s. [2026-03-25 23:30:12,865][__main__][INFO] - Starting iteration 570. [2026-03-25 23:30:12,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:30:12,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:30:18,687][__main__][INFO] - Number of regex retries in iteration 570: 0 [2026-03-25 23:30:18,688][__main__][INFO] - agents played in iteration 570 are Bob, Alice [2026-03-25 23:30:19,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:30:19,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:30:19,559][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:30:19,559][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:30:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:30:20,891][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:30:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:30:22,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:30:23,045][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:30:23,762][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:30:24,478][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:30:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:30:25,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:30:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:30:27,349][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:30:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:30:28,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:30:29,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:30:30,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:30:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:30:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:30:32,369][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:30:33,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:30:33,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:30:34,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:30:35,237][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:30:35,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:30:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:30:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:30:38,105][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:30:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:30:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:30:40,259][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:30:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:30:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:30:42,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:30:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:30:43,846][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:30:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:30:45,283][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:30:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:30:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:30:47,437][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:30:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:30:48,873][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:30:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:30:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:30:51,026][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:30:51,746][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:30:52,463][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:30:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:30:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:30:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:30:55,610][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:30:56,331][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:30:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:30:57,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:30:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:30:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:30:59,925][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:31:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:31:01,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:31:02,079][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:31:02,798][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:31:03,516][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:31:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:31:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:31:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:31:06,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:31:07,114][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:31:09,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:31:09,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:31:09,646][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:31:22,405][__main__][INFO] - Iteration 571 took 1m 9s (8.37% Gen, 73.28% Train). Generation: 5s, Training: 50s. Estimated remaining time: 10h 12m 9s. Estimated total time: 19h 18m 58s. Time estimates for 10 more iterations: 11m 35s, 100 more iterations: 1h 55m 53s, 500 more iterations: 9h 39m 29s. [2026-03-25 23:31:22,408][__main__][INFO] - Starting iteration 571. [2026-03-25 23:31:22,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:31:22,413][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:31:31,961][__main__][INFO] - Number of regex retries in iteration 571: 0 [2026-03-25 23:31:31,962][__main__][INFO] - agents played in iteration 571 are Bob, Alice [2026-03-25 23:31:32,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:31:32,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:31:32,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:31:32,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:31:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:31:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:31:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:31:35,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:31:36,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:31:36,845][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:31:37,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:31:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:31:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:31:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:31:40,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:31:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:31:41,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:31:42,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:31:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:31:43,988][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:31:44,703][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:31:45,418][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:31:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:31:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:31:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:31:51,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:31:52,586][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:31:53,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:31:54,013][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:31:54,726][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:31:55,442][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:31:56,156][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:31:56,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:31:57,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:31:58,300][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:31:59,015][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:31:59,731][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:32:00,447][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:32:01,162][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:32:01,877][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:32:02,594][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:32:03,310][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:32:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:32:04,744][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:32:05,461][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:32:06,175][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:32:06,893][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:32:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:32:08,326][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:32:09,044][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:32:09,760][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:32:10,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:32:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:32:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:32:12,878][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:32:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:32:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:32:15,035][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:32:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:32:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:32:17,192][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:32:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:32:18,630][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:32:19,348][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:32:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:32:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:32:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:32:22,227][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:32:22,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:32:23,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-25 23:32:24,798][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:32:24,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:32:24,803][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:32:26,287][__main__][INFO] - Iteration 572 took 1m 3s (14.95% Gen, 82.72% Train). Generation: 9s, Training: 52s. Estimated remaining time: 8h 36m 44s. Estimated total time: 17h 44m 36s. Time estimates for 10 more iterations: 10m 38s, 100 more iterations: 1h 46m 27s, 500 more iterations: 8h 52m 18s. [2026-03-25 23:32:26,289][__main__][INFO] - Starting iteration 572. [2026-03-25 23:32:26,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:32:26,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:32:31,260][__main__][INFO] - Number of regex retries in iteration 572: 0 [2026-03-25 23:32:31,262][__main__][INFO] - agents played in iteration 572 are Bob, Alice [2026-03-25 23:32:31,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:32:31,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:32:31,841][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:32:31,842][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:32:32,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:32:33,175][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:32:33,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:32:34,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:32:35,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:32:36,038][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:32:36,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:32:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:32:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:32:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:32:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:32:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:32:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:32:41,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:32:42,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:32:43,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:32:43,912][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:32:44,628][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:32:45,344][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:32:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:32:46,776][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:32:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:32:48,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:32:48,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:32:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:32:50,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:32:51,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:32:51,796][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:32:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:32:53,229][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:32:53,946][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:32:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:32:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:32:56,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:32:56,815][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:32:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:32:58,250][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:32:58,966][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:32:59,686][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:33:00,403][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:33:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:33:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:33:02,557][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:33:03,276][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:33:03,994][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:33:04,717][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:33:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:33:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:33:07,207][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:33:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:33:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:33:09,369][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:33:10,089][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:33:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:33:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:33:12,246][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:33:12,963][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:33:13,681][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:33:14,399][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:33:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:33:15,834][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:33:16,553][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:33:17,269][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:33:17,988][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:33:18,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:33:19,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:33:20,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:33:20,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:33:20,587][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:33:21,897][__main__][INFO] - Iteration 573 took 55s (8.93% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 17m 57s. Estimated total time: 15h 26m 45s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 40s, 500 more iterations: 7h 43m 22s. [2026-03-25 23:33:21,900][__main__][INFO] - Starting iteration 573. [2026-03-25 23:33:21,904][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:33:21,905][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:33:26,956][__main__][INFO] - Number of regex retries in iteration 573: 0 [2026-03-25 23:33:26,957][__main__][INFO] - agents played in iteration 573 are Bob, Alice [2026-03-25 23:33:27,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:33:27,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:33:27,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:33:27,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:33:28,215][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:33:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:33:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:33:30,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:33:31,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:33:31,730][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:33:32,447][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:33:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:33:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:33:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:33:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:33:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:33:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:33:37,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:33:38,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:33:38,900][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:33:39,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:33:40,334][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:33:41,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:33:41,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:33:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:33:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:33:43,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:33:44,638][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:33:45,358][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:33:46,077][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:33:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:33:47,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:33:48,228][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:33:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:33:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:33:50,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:33:51,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:33:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:33:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:33:53,250][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:33:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:33:54,683][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:33:55,402][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:33:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:33:56,837][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:33:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:33:58,271][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:33:58,989][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:33:59,707][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:34:00,425][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:34:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:34:01,861][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:34:02,835][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:34:03,554][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:34:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:34:04,989][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:34:05,708][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:34:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:34:07,147][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:34:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:34:08,587][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:34:09,307][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:34:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:34:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:34:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:34:12,180][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:34:12,899][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:34:13,617][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:34:14,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:34:15,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:34:16,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:34:16,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:34:16,214][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:34:17,624][__main__][INFO] - Iteration 574 took 55s (9.07% Gen, 88.40% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 18m 58s. Estimated total time: 15h 28m 41s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 20s. [2026-03-25 23:34:17,630][__main__][INFO] - Starting iteration 574. [2026-03-25 23:34:17,637][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:34:17,638][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:34:22,940][__main__][INFO] - Number of regex retries in iteration 574: 0 [2026-03-25 23:34:22,942][__main__][INFO] - agents played in iteration 574 are Bob, Alice [2026-03-25 23:34:23,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:34:23,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:34:23,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:34:23,719][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:34:24,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:34:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:34:25,776][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:34:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:34:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:34:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:34:28,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:34:29,362][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:34:30,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:34:30,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:34:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:34:32,233][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:34:32,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:34:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:34:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:34:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:34:35,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:34:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:34:37,257][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:34:37,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:34:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:34:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:34:40,128][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:34:40,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:34:41,563][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:34:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:34:43,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:34:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:34:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:34:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:34:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:34:46,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:34:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:34:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:34:48,754][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:34:49,474][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:34:50,194][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:34:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:34:51,634][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:34:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:34:53,073][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:34:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:34:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:34:55,231][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:34:55,951][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:34:56,670][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:34:57,390][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:34:58,111][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:34:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:34:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:35:00,517][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:35:01,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:35:01,957][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:35:02,678][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:35:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:35:04,119][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:35:04,837][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:35:05,559][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:35:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:35:06,998][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:35:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:35:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:35:09,161][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:35:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:35:10,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:35:11,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:35:12,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:35:12,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:35:12,415][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:35:13,777][__main__][INFO] - Iteration 575 took 56s (9.44% Gen, 88.12% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 25m 3s. Estimated total time: 15h 35m 43s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 34s, 500 more iterations: 7h 47m 51s. [2026-03-25 23:35:13,780][__main__][INFO] - Starting iteration 575. [2026-03-25 23:35:13,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:35:13,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:35:18,740][__main__][INFO] - Number of regex retries in iteration 575: 0 [2026-03-25 23:35:18,741][__main__][INFO] - agents played in iteration 575 are Bob, Alice [2026-03-25 23:35:19,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:35:19,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:35:19,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:35:19,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:35:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:35:20,693][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:35:21,413][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:35:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:35:22,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:35:23,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:35:24,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:35:25,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:35:25,730][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:35:26,449][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:35:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:35:27,887][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:35:28,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:35:29,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:35:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:35:30,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:35:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:35:32,202][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:35:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:35:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:35:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:35:35,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:35:35,794][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:35:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:35:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:35:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:35:38,671][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:35:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:35:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:35:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:35:41,550][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:35:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:35:42,987][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:35:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:35:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:35:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:35:45,868][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:35:46,587][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:35:47,307][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:35:48,028][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:35:48,747][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:35:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:35:50,188][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:35:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:35:51,626][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:35:52,347][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:35:53,067][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:35:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:35:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:35:55,519][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:35:56,240][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:35:56,960][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:35:57,679][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:35:58,400][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:35:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:35:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:36:00,561][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:36:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:36:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:36:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:36:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:36:04,165][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:36:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:36:05,605][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:36:06,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:36:07,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:36:08,170][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:36:08,173][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:36:08,175][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:36:09,435][__main__][INFO] - Iteration 576 took 55s (8.90% Gen, 88.82% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 15m 58s. Estimated total time: 15h 27m 33s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 46s. [2026-03-25 23:36:09,438][__main__][INFO] - Starting iteration 576. [2026-03-25 23:36:09,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:36:09,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:36:10,617][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2026-03-25 23:36:16,572][__main__][INFO] - Number of regex retries in iteration 576: 1 [2026-03-25 23:36:16,573][__main__][INFO] - agents played in iteration 576 are Bob, Alice [2026-03-25 23:36:17,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:36:17,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:36:17,157][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:36:17,158][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:36:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:36:18,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:36:19,253][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:36:19,970][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:36:20,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:36:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:36:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:36:22,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:36:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:36:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:36:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:36:25,720][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:36:26,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:36:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:36:27,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:36:28,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:36:29,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:36:30,041][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:36:30,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:36:31,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:36:32,202][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:36:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:36:33,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:36:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:36:35,079][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:36:35,800][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:36:36,518][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:36:37,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:36:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:36:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:36:39,394][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:36:40,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:36:40,832][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:36:41,550][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:36:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:36:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:36:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:36:44,430][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:36:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:36:45,868][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:36:46,588][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:36:47,307][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:36:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:36:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:36:49,465][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:36:50,185][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:36:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:36:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:36:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:36:53,336][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:36:54,055][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:36:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:36:55,494][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:36:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:36:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:36:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:36:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:36:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:36:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:37:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:37:01,255][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:37:01,975][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:37:02,696][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:37:03,416][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:37:04,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:37:04,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:37:06,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:37:06,090][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:37:06,092][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:37:07,405][__main__][INFO] - Iteration 577 took 57s (12.30% Gen, 85.43% Train). Generation: 7s, Training: 49s. Estimated remaining time: 6h 53m 31s. Estimated total time: 16h 6m 4s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 36s, 500 more iterations: 8h 3m 2s. [2026-03-25 23:37:07,408][__main__][INFO] - Starting iteration 577. [2026-03-25 23:37:07,412][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:37:07,412][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:37:12,440][__main__][INFO] - Number of regex retries in iteration 577: 0 [2026-03-25 23:37:12,441][__main__][INFO] - agents played in iteration 577 are Bob, Alice [2026-03-25 23:37:12,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:37:13,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:37:13,055][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:37:13,055][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:37:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:37:14,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:37:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:37:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:37:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:37:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:37:18,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:37:18,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:37:19,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:37:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:37:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:37:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:37:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:37:23,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:37:23,761][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:37:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:37:25,200][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:37:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:37:26,640][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:37:27,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:37:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:37:28,795][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:37:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:37:30,232][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:37:30,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:37:31,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:37:32,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:37:33,108][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:37:33,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:37:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:37:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:37:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:37:36,703][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:37:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:37:38,142][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:37:38,861][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:37:39,581][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:37:40,301][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:37:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:37:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:37:42,459][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:37:43,178][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:37:43,898][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:37:44,619][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:37:45,338][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:37:46,059][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:37:46,779][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:37:47,499][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:37:48,462][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:37:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:37:49,902][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:37:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:37:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:37:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:37:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:37:53,503][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:37:54,224][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:37:54,944][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:37:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:37:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:37:57,106][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:37:57,826][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:37:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:37:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:37:59,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:38:00,755][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:38:01,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:38:01,863][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:38:01,865][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:38:03,384][__main__][INFO] - Iteration 578 took 55s (8.98% Gen, 88.30% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 19m 25s. Estimated total time: 15h 32m 54s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 27s. [2026-03-25 23:38:03,387][__main__][INFO] - Starting iteration 578. [2026-03-25 23:38:03,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:38:03,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:38:08,619][__main__][INFO] - Number of regex retries in iteration 578: 0 [2026-03-25 23:38:08,620][__main__][INFO] - agents played in iteration 578 are Bob, Alice [2026-03-25 23:38:09,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:38:09,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:38:09,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:38:09,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:38:10,006][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:38:10,656][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:38:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:38:12,094][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:38:12,813][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:38:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:38:14,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:38:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:38:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:38:16,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:38:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:38:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:38:18,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:38:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:38:20,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:38:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:38:21,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:38:22,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:38:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:38:23,620][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:38:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:38:25,066][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:38:25,785][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:38:26,505][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:38:27,224][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:38:27,943][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:38:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:38:29,384][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:38:30,103][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:38:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:38:31,541][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:38:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:38:32,979][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:38:33,700][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:38:34,418][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:38:35,138][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:38:35,858][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:38:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:38:37,297][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:38:38,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:38:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:38:39,457][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:38:40,177][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:38:40,897][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:38:41,617][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:38:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:38:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:38:43,777][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:38:44,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:38:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:38:46,237][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:38:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:38:47,678][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:38:48,398][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:38:49,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:38:49,838][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:38:50,557][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:38:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:38:51,999][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:38:52,719][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:38:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:38:54,158][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:38:54,879][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:38:55,600][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:38:56,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:38:57,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:38:58,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:38:58,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:38:58,178][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:39:01,981][__main__][INFO] - Iteration 579 took 58s (8.92% Gen, 84.58% Train). Generation: 5s, Training: 49s. Estimated remaining time: 7h 2m 3s. Estimated total time: 16h 16m 31s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 39s, 500 more iterations: 8h 8m 15s. [2026-03-25 23:39:01,984][__main__][INFO] - Starting iteration 579. [2026-03-25 23:39:01,989][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:39:01,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:39:07,392][__main__][INFO] - Number of regex retries in iteration 579: 0 [2026-03-25 23:39:07,393][__main__][INFO] - agents played in iteration 579 are Bob, Alice [2026-03-25 23:39:07,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:39:08,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:39:08,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:39:08,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:39:08,692][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:39:09,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:39:10,057][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:39:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:39:11,490][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:39:12,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:39:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:39:13,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:39:14,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:39:15,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:39:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:39:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:39:17,226][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:39:17,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:39:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:39:19,380][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:39:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:39:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:39:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:39:22,251][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:39:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:39:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:39:24,404][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:39:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:39:25,840][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:39:26,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:39:27,276][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:39:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:39:28,714][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:39:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:39:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:39:30,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:39:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:39:32,308][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:39:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:39:33,742][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:39:34,459][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:39:35,177][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:39:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:39:36,611][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:39:37,331][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:39:38,049][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:39:38,769][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:39:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:39:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:39:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:39:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:39:42,359][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:39:43,340][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:39:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:39:44,777][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:39:45,501][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:39:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:39:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:39:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:39:48,393][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:39:49,115][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:39:49,839][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:39:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:39:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:39:51,996][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:39:52,717][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:39:53,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:39:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:39:54,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:39:55,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:39:56,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:39:56,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:39:56,986][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:39:58,576][__main__][INFO] - Iteration 580 took 56s (9.55% Gen, 87.64% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 27m 44s. Estimated total time: 15h 43m 9s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 18s, 500 more iterations: 7h 51m 34s. [2026-03-25 23:39:58,579][__main__][INFO] - Starting iteration 580. [2026-03-25 23:39:58,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:39:58,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:40:03,746][__main__][INFO] - Number of regex retries in iteration 580: 0 [2026-03-25 23:40:03,748][__main__][INFO] - agents played in iteration 580 are Bob, Alice [2026-03-25 23:40:04,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:40:04,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:40:04,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:40:04,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:40:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:40:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:40:06,387][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:40:07,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:40:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:40:08,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:40:09,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:40:09,975][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:40:10,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:40:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:40:12,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:40:12,844][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:40:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:40:14,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:40:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:40:15,716][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:40:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:40:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:40:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:40:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:40:19,310][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:40:20,027][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:40:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:40:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:40:22,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:40:22,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:40:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:40:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:40:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:40:25,774][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:40:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:40:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:40:27,927][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:40:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:40:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:40:30,080][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:40:30,797][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:40:31,516][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:40:32,232][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:40:32,952][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:40:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:40:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:40:35,107][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:40:35,824][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:40:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:40:37,261][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:40:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:40:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:40:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:40:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:40:41,085][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:40:41,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:40:42,522][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:40:43,241][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:40:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:40:44,677][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:40:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:40:46,115][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:40:46,836][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:40:47,555][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:40:48,274][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:40:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:40:49,711][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:40:50,429][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:40:51,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:40:51,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:40:53,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:40:53,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:40:53,047][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:40:54,405][__main__][INFO] - Iteration 581 took 55s (9.24% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 14m 1s. Estimated total time: 15h 30m 21s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 10s. [2026-03-25 23:40:54,408][__main__][INFO] - Starting iteration 581. [2026-03-25 23:40:54,412][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:40:54,413][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:41:01,849][__main__][INFO] - Number of regex retries in iteration 581: 0 [2026-03-25 23:41:01,850][__main__][INFO] - agents played in iteration 581 are Bob, Alice [2026-03-25 23:41:02,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:41:02,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:41:02,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:41:02,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:41:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:41:03,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:41:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:41:05,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:41:05,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:41:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:41:07,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:41:08,049][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:41:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:41:09,484][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:41:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:41:10,916][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:41:11,632][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:41:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:41:13,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:41:13,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:41:14,503][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:41:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:41:15,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:41:16,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:41:17,371][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:41:18,089][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:41:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:41:19,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:41:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:41:20,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:41:21,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:41:22,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:41:23,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:41:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:41:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:41:25,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:41:25,992][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:41:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:41:27,428][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:41:28,147][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:41:28,866][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:41:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:41:30,302][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:41:31,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:41:31,741][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:41:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:41:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:41:33,894][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:41:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:41:35,329][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:41:36,048][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:41:36,764][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:41:37,766][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:41:38,484][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:41:39,203][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:41:39,922][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:41:40,638][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:41:41,356][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:41:42,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:41:42,791][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:41:43,509][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:41:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:41:44,945][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:41:45,663][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:41:46,382][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:41:47,100][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:41:47,818][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:41:48,536][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:41:49,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:41:49,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:41:51,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:41:51,150][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:41:51,152][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:41:52,479][__main__][INFO] - Iteration 582 took 58s (12.81% Gen, 84.90% Train). Generation: 7s, Training: 49s. Estimated remaining time: 6h 50m 30s. Estimated total time: 16h 7m 48s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 46s, 500 more iterations: 8h 3m 54s. [2026-03-25 23:41:52,481][__main__][INFO] - Starting iteration 582. [2026-03-25 23:41:52,485][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:41:52,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:41:57,381][__main__][INFO] - Number of regex retries in iteration 582: 0 [2026-03-25 23:41:57,382][__main__][INFO] - agents played in iteration 582 are Bob, Alice [2026-03-25 23:41:58,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:41:58,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:41:58,198][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:41:58,199][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:41:58,889][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:41:59,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:42:00,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:42:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:42:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:42:02,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:42:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:42:03,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:42:04,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:42:05,274][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:42:05,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:42:06,709][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:42:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:42:08,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:42:08,863][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:42:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:42:10,298][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:42:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:42:11,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:42:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:42:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:42:13,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:42:14,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:42:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:42:16,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:42:16,760][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:42:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:42:18,197][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:42:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:42:19,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:42:20,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:42:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:42:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:42:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:42:23,226][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:42:23,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:42:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:42:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:42:26,100][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:42:26,817][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:42:27,537][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:42:28,256][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:42:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:42:29,692][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:42:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:42:31,128][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:42:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:42:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:42:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:42:34,259][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:42:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:42:35,695][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:42:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:42:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:42:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:42:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:42:39,284][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:42:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:42:40,719][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:42:41,438][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:42:42,156][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:42:42,873][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:42:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:42:44,308][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:42:45,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:42:45,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:42:46,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:42:46,714][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:42:46,715][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:42:48,070][__main__][INFO] - Iteration 583 took 55s (8.81% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 8m 12s. Estimated total time: 15h 26m 26s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 13s. [2026-03-25 23:42:48,075][__main__][INFO] - Starting iteration 583. [2026-03-25 23:42:48,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:42:48,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:42:53,001][__main__][INFO] - Number of regex retries in iteration 583: 0 [2026-03-25 23:42:53,002][__main__][INFO] - agents played in iteration 583 are Bob, Alice [2026-03-25 23:42:53,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:42:53,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:42:53,569][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:42:53,569][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:42:54,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:42:54,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:42:55,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:42:56,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:42:57,066][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:42:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:42:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:42:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:42:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:43:00,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:43:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:43:02,084][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:43:02,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:43:03,519][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:43:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:43:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:43:05,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:43:06,389][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:43:07,105][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:43:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:43:08,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:43:09,264][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:43:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:43:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:43:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:43:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:43:12,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:43:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:43:14,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:43:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:43:15,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:43:16,446][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:43:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:43:17,882][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:43:18,600][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:43:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:43:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:43:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:43:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:43:22,193][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:43:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:43:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:43:24,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:43:25,066][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:43:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:43:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:43:27,219][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:43:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:43:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:43:29,604][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:43:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:43:31,039][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:43:31,757][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:43:32,475][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:43:33,194][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:43:33,911][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:43:34,629][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:43:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:43:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:43:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:43:37,500][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:43:38,219][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:43:38,939][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:43:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:43:40,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:43:41,102][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:43:42,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:43:42,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:43:42,068][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:43:43,427][__main__][INFO] - Iteration 584 took 55s (8.89% Gen, 88.65% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 3m 20s. Estimated total time: 15h 22m 29s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 14s. [2026-03-25 23:43:43,430][__main__][INFO] - Starting iteration 584. [2026-03-25 23:43:43,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:43:43,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:43:48,394][__main__][INFO] - Number of regex retries in iteration 584: 0 [2026-03-25 23:43:48,395][__main__][INFO] - agents played in iteration 584 are Bob, Alice [2026-03-25 23:43:48,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:43:48,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:43:48,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:43:48,962][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:43:49,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:43:50,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:43:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:43:51,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:43:52,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:43:53,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:43:53,882][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:43:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:43:55,316][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:43:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:43:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:43:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:43:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:43:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:43:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:44:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:44:01,055][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:44:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:44:02,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:44:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:44:03,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:44:04,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:44:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:44:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:44:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:44:07,516][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:44:08,233][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:44:08,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:44:09,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:44:10,390][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:44:11,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:44:11,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:44:12,546][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:44:13,266][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:44:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:44:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:44:15,420][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:44:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:44:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:44:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:44:18,295][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:44:19,014][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:44:19,732][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:44:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:44:21,169][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:44:21,887][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:44:22,605][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:44:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:44:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:44:25,015][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:44:25,733][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:44:26,451][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:44:27,168][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:44:27,887][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:44:28,604][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:44:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:44:30,041][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:44:30,759][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:44:31,477][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:44:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:44:32,912][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:44:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:44:34,348][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:44:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:44:35,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:44:36,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:44:37,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:44:37,579][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:44:37,580][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:44:38,894][__main__][INFO] - Iteration 585 took 55s (8.94% Gen, 88.68% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 4m 17s. Estimated total time: 15h 24m 21s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 26s, 500 more iterations: 7h 42m 10s. [2026-03-25 23:44:38,897][__main__][INFO] - Starting iteration 585. [2026-03-25 23:44:38,901][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:44:38,902][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:44:43,878][__main__][INFO] - Number of regex retries in iteration 585: 0 [2026-03-25 23:44:43,880][__main__][INFO] - agents played in iteration 585 are Bob, Alice [2026-03-25 23:44:44,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:44:44,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:44:44,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:44:44,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:44:45,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:44:45,833][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:44:46,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:44:47,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:44:47,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:44:48,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:44:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:44:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:44:50,856][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:44:51,574][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:44:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:44:53,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:44:53,728][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:44:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:44:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:44:55,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:44:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:44:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:44:58,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:44:58,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:44:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:45:00,193][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:45:00,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:45:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:45:02,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:45:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:45:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:45:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:45:05,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:45:05,941][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:45:06,660][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:45:07,378][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:45:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:45:08,817][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:45:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:45:10,256][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:45:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:45:11,693][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:45:12,412][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:45:13,131][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:45:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:45:14,568][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:45:15,287][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:45:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:45:16,726][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:45:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:45:18,161][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:45:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:45:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:45:20,603][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:45:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:45:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:45:22,757][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:45:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:45:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:45:24,910][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:45:25,628][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:45:26,347][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:45:27,065][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:45:27,783][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:45:28,501][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:45:29,219][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:45:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:45:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:45:31,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:45:32,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:45:33,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:45:33,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:45:33,413][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:45:34,752][__main__][INFO] - Iteration 586 took 55s (8.91% Gen, 88.68% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 9m 53s. Estimated total time: 15h 30m 53s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 26s. [2026-03-25 23:45:34,756][__main__][INFO] - Starting iteration 586. [2026-03-25 23:45:34,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:45:34,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:45:39,847][__main__][INFO] - Number of regex retries in iteration 586: 0 [2026-03-25 23:45:39,848][__main__][INFO] - agents played in iteration 586 are Bob, Alice [2026-03-25 23:45:40,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:45:40,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:45:40,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:45:40,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:45:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:45:41,833][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:45:42,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:45:43,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:45:43,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:45:44,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:45:45,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:45:46,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:45:46,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:45:47,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:45:48,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:45:49,003][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:45:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:45:50,438][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:45:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:45:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:45:52,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:45:53,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:45:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:45:54,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:45:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:45:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:45:56,898][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:45:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:45:58,340][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:45:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:45:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:46:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:46:01,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:46:01,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:46:02,656][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:46:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:46:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:46:04,812][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:46:05,532][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:46:06,254][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:46:06,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:46:07,693][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:46:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:46:09,136][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:46:09,856][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:46:10,574][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:46:11,296][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:46:12,014][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:46:12,732][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:46:13,452][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:46:14,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:46:14,885][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:46:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:46:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:46:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:46:17,987][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:46:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:46:19,421][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:46:20,138][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:46:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:46:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:46:22,293][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:46:23,011][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:46:23,729][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:46:24,447][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:46:25,165][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:46:25,883][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:46:26,602][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:46:27,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:46:28,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:46:28,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:46:29,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:46:29,002][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:46:30,647][__main__][INFO] - Iteration 587 took 55s (9.10% Gen, 87.95% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 9m 33s. Estimated total time: 15h 31m 29s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 44s. [2026-03-25 23:46:30,651][__main__][INFO] - Starting iteration 587. [2026-03-25 23:46:30,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:46:30,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:46:35,656][__main__][INFO] - Number of regex retries in iteration 587: 0 [2026-03-25 23:46:35,657][__main__][INFO] - agents played in iteration 587 are Bob, Alice [2026-03-25 23:46:36,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:46:36,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:46:36,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:46:36,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:46:37,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:46:37,884][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:46:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:46:39,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:46:40,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:46:40,754][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:46:41,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:46:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:46:42,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:46:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:46:44,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:46:45,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:46:45,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:46:46,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:46:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:46:47,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:46:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:46:49,356][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:46:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:46:50,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:46:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:46:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:46:52,944][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:46:53,662][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:46:54,380][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:46:55,097][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:46:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:46:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:46:57,250][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:46:57,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:46:58,688][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:46:59,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:47:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:47:00,843][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:47:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:47:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:47:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:47:03,718][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:47:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:47:05,153][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:47:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:47:06,587][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:47:07,305][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:47:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:47:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:47:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:47:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:47:10,896][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:47:11,842][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:47:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:47:13,281][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:47:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:47:14,717][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:47:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:47:16,153][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:47:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:47:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:47:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:47:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:47:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:47:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:47:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:47:21,896][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:47:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:47:23,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:47:24,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:47:25,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:47:25,104][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:47:25,105][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:47:26,481][__main__][INFO] - Iteration 588 took 55s (8.95% Gen, 88.57% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 7m 34s. Estimated total time: 15h 30m 26s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 13s. [2026-03-25 23:47:26,484][__main__][INFO] - Starting iteration 588. [2026-03-25 23:47:26,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:47:26,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:47:31,506][__main__][INFO] - Number of regex retries in iteration 588: 0 [2026-03-25 23:47:31,507][__main__][INFO] - agents played in iteration 588 are Bob, Alice [2026-03-25 23:47:32,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:47:32,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:47:32,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:47:32,074][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:47:32,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:47:33,406][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:47:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:47:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:47:35,555][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:47:36,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:47:36,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:47:37,705][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:47:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:47:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:47:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:47:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:47:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:47:42,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:47:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:47:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:47:44,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:47:44,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:47:45,597][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:47:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:47:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:47:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:47:48,470][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:47:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:47:49,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:47:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:47:51,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:47:52,061][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:47:52,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:47:53,498][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:47:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:47:54,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:47:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:47:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:47:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:47:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:47:58,520][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:47:59,237][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:47:59,954][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:48:00,671][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:48:01,389][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:48:02,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:48:02,826][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:48:03,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:48:04,260][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:48:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:48:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:48:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:48:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:48:08,177][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:48:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:48:09,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:48:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:48:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:48:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:48:12,487][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:48:13,206][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:48:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:48:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:48:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:48:16,079][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:48:16,797][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:48:17,514][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:48:18,235][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:48:18,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:48:19,690][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:48:20,777][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:48:20,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:48:20,782][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:48:22,200][__main__][INFO] - Iteration 589 took 55s (9.01% Gen, 88.45% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 4m 44s. Estimated total time: 15h 28m 32s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 16s. [2026-03-25 23:48:22,203][__main__][INFO] - Starting iteration 589. [2026-03-25 23:48:22,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:48:22,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:48:27,159][__main__][INFO] - Number of regex retries in iteration 589: 0 [2026-03-25 23:48:27,160][__main__][INFO] - agents played in iteration 589 are Bob, Alice [2026-03-25 23:48:27,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:48:27,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:48:27,738][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:48:27,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:48:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:48:29,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:48:29,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:48:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:48:31,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:48:31,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:48:32,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:48:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:48:34,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:48:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:48:35,518][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:48:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:48:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:48:37,670][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:48:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:48:39,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:48:39,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:48:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:48:41,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:48:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:48:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:48:43,416][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:48:44,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:48:44,853][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:48:45,570][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:48:46,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:48:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:48:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:48:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:48:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:48:49,882][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:48:50,598][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:48:51,316][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:48:52,032][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:48:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:48:53,467][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:48:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:48:54,903][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:48:55,622][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:48:56,339][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:48:57,058][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:48:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:48:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:48:59,213][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:48:59,930][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:49:00,649][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:49:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:49:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:49:03,044][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:49:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:49:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:49:05,199][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:49:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:49:06,635][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:49:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:49:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:49:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:49:09,509][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:49:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:49:10,946][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:49:11,664][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:49:12,383][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:49:13,102][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:49:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:49:14,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:49:15,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:49:16,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:49:16,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:49:16,223][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:49:17,639][__main__][INFO] - Iteration 590 took 55s (8.94% Gen, 88.51% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 59m 8s. Estimated total time: 15h 23m 52s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 56s. [2026-03-25 23:49:17,643][__main__][INFO] - Starting iteration 590. [2026-03-25 23:49:17,647][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:49:17,647][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:49:22,611][__main__][INFO] - Number of regex retries in iteration 590: 0 [2026-03-25 23:49:22,613][__main__][INFO] - agents played in iteration 590 are Bob, Alice [2026-03-25 23:49:23,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:49:23,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:49:23,176][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:49:23,177][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:49:23,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:49:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:49:25,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:49:25,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:49:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:49:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:49:28,094][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:49:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:49:29,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:49:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:49:30,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:49:31,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:49:32,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:49:33,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:49:33,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:49:34,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:49:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:49:35,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:49:36,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:49:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:49:38,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:49:38,865][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:49:39,586][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:49:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:49:41,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:49:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:49:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:49:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:49:43,897][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:49:44,618][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:49:45,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:49:46,059][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:49:46,777][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:49:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:49:48,212][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:49:48,930][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:49:49,649][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:49:50,367][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:49:51,086][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:49:51,804][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:49:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:49:53,241][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:49:53,961][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:49:54,681][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:49:55,399][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:49:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:49:56,835][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:49:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:49:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:49:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:49:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:50:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:50:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:50:02,091][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:50:02,809][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:50:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:50:04,247][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:50:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:50:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:50:06,405][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:50:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:50:07,843][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:50:08,563][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:50:09,282][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:50:10,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:50:10,755][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:50:11,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:50:11,852][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:50:11,854][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:50:13,220][__main__][INFO] - Iteration 591 took 55s (8.93% Gen, 88.60% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 0m 36s. Estimated total time: 15h 26m 15s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 37s, 500 more iterations: 7h 43m 7s. [2026-03-25 23:50:13,223][__main__][INFO] - Starting iteration 591. [2026-03-25 23:50:13,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:50:13,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:50:14,377][mllm.models.large_language_model_local][WARNING] - Response A did not match regex: (|), retry 1/1 [2026-03-25 23:50:18,416][__main__][INFO] - Number of regex retries in iteration 591: 1 [2026-03-25 23:50:18,417][__main__][INFO] - agents played in iteration 591 are Bob, Alice [2026-03-25 23:50:18,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:50:18,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:50:18,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:50:18,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:50:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:50:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:50:21,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:50:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:50:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:50:23,203][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:50:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:50:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:50:25,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:50:26,074][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:50:26,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:50:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:50:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:50:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:50:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:50:30,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:50:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:50:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:50:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:50:33,258][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:50:33,975][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:50:34,693][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:50:35,409][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:50:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:50:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:50:37,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:50:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:50:39,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:50:39,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:50:40,441][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:50:41,160][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:50:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:50:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:50:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:50:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:50:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:50:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:50:46,193][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:50:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:50:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:50:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:50:49,070][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:50:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:50:50,510][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:50:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:50:51,949][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:50:52,669][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:50:53,390][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:50:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:50:55,154][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:50:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:50:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:50:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:50:58,028][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:50:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:50:59,466][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:51:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:51:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:51:01,623][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:51:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:51:03,061][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:51:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:51:04,499][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:51:05,217][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:51:05,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:51:06,668][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:51:07,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:51:07,620][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:51:07,621][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:51:08,899][__main__][INFO] - Iteration 592 took 55s (9.32% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 1m 18s. Estimated total time: 15h 27m 53s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s. [2026-03-25 23:51:08,901][__main__][INFO] - Starting iteration 592. [2026-03-25 23:51:08,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:51:08,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:51:13,954][__main__][INFO] - Number of regex retries in iteration 592: 0 [2026-03-25 23:51:13,955][__main__][INFO] - agents played in iteration 592 are Bob, Alice [2026-03-25 23:51:14,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:51:14,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:51:14,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:51:14,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:51:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:51:16,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:51:16,772][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:51:17,490][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:51:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:51:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:51:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:51:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:51:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:51:21,795][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:51:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:51:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:51:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:51:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:51:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:51:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:51:26,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:51:27,539][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:51:28,258][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:51:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:51:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:51:30,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:51:31,132][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:51:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:51:32,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:51:33,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:51:34,005][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:51:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:51:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:51:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:51:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:51:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:51:38,311][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:51:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:51:39,747][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:51:40,465][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:51:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:51:41,901][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:51:42,618][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:51:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:51:44,054][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:51:44,773][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:51:45,491][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:51:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:51:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:51:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:51:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:51:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:51:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:51:50,757][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:51:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:51:52,192][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:51:52,911][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:51:53,628][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:51:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:51:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:51:55,783][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:51:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:51:57,220][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:51:57,940][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:51:58,658][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:51:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:52:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:52:00,813][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:52:01,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:52:02,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:52:03,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:52:03,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:52:03,348][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:52:04,723][__main__][INFO] - Iteration 593 took 55s (9.05% Gen, 88.49% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 2m 48s. Estimated total time: 15h 30m 19s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 9s. [2026-03-25 23:52:04,726][__main__][INFO] - Starting iteration 593. [2026-03-25 23:52:04,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:52:04,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:52:09,652][__main__][INFO] - Number of regex retries in iteration 593: 0 [2026-03-25 23:52:09,653][__main__][INFO] - agents played in iteration 593 are Bob, Alice [2026-03-25 23:52:10,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:52:10,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:52:10,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:52:10,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:52:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:52:11,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:52:12,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:52:13,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:52:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:52:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:52:15,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:52:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:52:16,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:52:17,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:52:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:52:18,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:52:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:52:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:52:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:52:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:52:22,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:52:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:52:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:52:24,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:52:25,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:52:26,029][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:52:26,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:52:27,463][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:52:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:52:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:52:29,621][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:52:30,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:52:31,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:52:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:52:32,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:52:33,214][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:52:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:52:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:52:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:52:36,091][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:52:36,808][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:52:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:52:38,248][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:52:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:52:39,685][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:52:40,407][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:52:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:52:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:52:42,565][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:52:43,283][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:52:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:52:44,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:52:45,680][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:52:46,405][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:52:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:52:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:52:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:52:49,295][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:52:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:52:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:52:51,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:52:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:52:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:52:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:52:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:52:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:52:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:52:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:52:57,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:52:57,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:52:58,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:52:58,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:52:58,946][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:53:00,288][__main__][INFO] - Iteration 594 took 55s (8.86% Gen, 88.72% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 57m 33s. Estimated total time: 15h 25m 59s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 59s. [2026-03-25 23:53:00,290][__main__][INFO] - Starting iteration 594. [2026-03-25 23:53:00,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:53:00,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:53:06,648][__main__][INFO] - Number of regex retries in iteration 594: 0 [2026-03-25 23:53:06,649][__main__][INFO] - agents played in iteration 594 are Bob, Alice [2026-03-25 23:53:07,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:53:07,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:53:07,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:53:07,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:53:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:53:08,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:53:09,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:53:09,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:53:10,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:53:11,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:53:12,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:53:12,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:53:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:53:14,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:53:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:53:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:53:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:53:17,159][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:53:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:53:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:53:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:53:20,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:53:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:53:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:53:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:53:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:53:23,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:53:24,336][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:53:25,054][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:53:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:53:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:53:27,204][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:53:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:53:28,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:53:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:53:30,077][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:53:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:53:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:53:32,228][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:53:32,948][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:53:33,665][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:53:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:53:35,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:53:35,819][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:53:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:53:37,255][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:53:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:53:38,691][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:53:39,409][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:53:40,128][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:53:40,846][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:53:41,564][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:53:42,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:53:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:53:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:53:44,767][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:53:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:53:46,202][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:53:46,921][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:53:47,638][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:53:48,357][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:53:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:53:49,793][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:53:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:53:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:53:51,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:53:52,667][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:53:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:53:54,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:53:54,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:53:55,906][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:53:55,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:53:55,910][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:53:57,417][__main__][INFO] - Iteration 595 took 57s (11.12% Gen, 86.24% Train). Generation: 6s, Training: 49s. Estimated remaining time: 6h 22m 40s. Estimated total time: 15h 52m 3s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 12s, 500 more iterations: 7h 56m 1s. [2026-03-25 23:53:57,419][__main__][INFO] - Starting iteration 595. [2026-03-25 23:53:57,424][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:53:57,424][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:54:02,535][__main__][INFO] - Number of regex retries in iteration 595: 0 [2026-03-25 23:54:02,536][__main__][INFO] - agents played in iteration 595 are Bob, Alice [2026-03-25 23:54:03,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:54:03,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:54:03,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:54:03,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:54:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:54:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:54:05,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:54:05,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:54:06,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:54:07,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:54:08,040][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:54:08,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:54:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:54:10,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:54:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:54:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:54:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:54:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:54:13,783][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:54:14,502][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:54:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:54:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:54:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:54:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:54:18,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:54:18,811][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:54:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:54:20,247][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:54:20,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:54:21,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:54:22,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:54:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:54:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:54:24,550][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:54:25,269][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:54:25,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:54:26,704][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:54:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:54:28,140][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:54:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:54:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:54:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:54:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:54:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:54:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:54:33,164][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:54:33,881][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:54:34,598][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:54:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:54:36,034][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:54:36,753][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:54:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:54:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:54:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:54:39,857][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:54:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:54:41,296][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:54:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:54:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:54:43,449][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:54:44,167][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:54:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:54:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:54:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:54:47,043][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:54:47,761][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:54:48,481][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:54:49,199][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:54:49,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:54:50,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:54:51,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:54:51,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:54:51,613][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:54:53,008][__main__][INFO] - Iteration 596 took 55s (9.20% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 56m 7s. Estimated total time: 15h 26m 26s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 13s. [2026-03-25 23:54:53,011][__main__][INFO] - Starting iteration 596. [2026-03-25 23:54:53,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:54:53,019][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:54:58,052][__main__][INFO] - Number of regex retries in iteration 596: 0 [2026-03-25 23:54:58,053][__main__][INFO] - agents played in iteration 596 are Bob, Alice [2026-03-25 23:54:58,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:54:58,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:54:58,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:54:58,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:54:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:54:59,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:55:00,677][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:55:01,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:55:02,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:55:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:55:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:55:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:55:04,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:55:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:55:06,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:55:07,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:55:07,850][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:55:08,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:55:09,285][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:55:10,001][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:55:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:55:11,434][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:55:12,150][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:55:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:55:13,583][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:55:14,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:55:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:55:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:55:16,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:55:17,168][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:55:17,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:55:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:55:19,321][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:55:20,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:55:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:55:21,474][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:55:22,193][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:55:22,910][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:55:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:55:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:55:25,065][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:55:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:55:26,500][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:55:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:55:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:55:28,656][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:55:29,374][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:55:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:55:30,811][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:55:31,528][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:55:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:55:32,966][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:55:33,921][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:55:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:55:35,358][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:55:36,077][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:55:36,795][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:55:37,513][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:55:38,233][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:55:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:55:39,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:55:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:55:41,108][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:55:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:55:42,546][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:55:43,266][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:55:43,985][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:55:44,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:55:45,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:55:46,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:55:47,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:55:47,127][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:55:47,128][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:55:48,477][__main__][INFO] - Iteration 597 took 55s (9.08% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 53m 8s. Estimated total time: 15h 24m 22s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 26s, 500 more iterations: 7h 42m 11s. [2026-03-25 23:55:48,480][__main__][INFO] - Starting iteration 597. [2026-03-25 23:55:48,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:55:48,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:55:53,488][__main__][INFO] - Number of regex retries in iteration 597: 0 [2026-03-25 23:55:53,489][__main__][INFO] - agents played in iteration 597 are Bob, Alice [2026-03-25 23:55:54,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:55:54,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:55:54,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:55:54,294][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:55:55,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:55:55,680][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:55:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:55:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:55:57,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:55:58,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:55:59,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:55:59,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:56:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:56:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:56:02,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:56:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:56:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:56:04,296][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:56:05,012][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:56:05,730][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:56:06,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:56:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:56:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:56:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:56:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:56:10,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:56:10,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:56:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:56:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:56:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:56:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:56:14,344][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:56:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:56:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:56:16,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:56:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:56:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:56:18,664][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:56:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:56:20,106][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:56:20,826][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:56:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:56:22,268][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:56:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:56:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:56:24,430][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:56:25,151][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:56:25,871][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:56:26,593][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:56:27,313][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:56:28,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:56:28,756][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:56:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:56:30,501][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:56:31,223][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:56:31,943][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:56:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:56:33,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:56:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:56:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:56:35,557][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:56:36,279][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:56:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:56:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:56:38,446][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:56:39,169][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:56:39,892][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:56:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:56:41,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:56:42,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-25 23:56:43,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:56:43,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:56:43,073][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:56:44,694][__main__][INFO] - Iteration 598 took 56s (8.90% Gen, 88.21% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 4m 40s. Estimated total time: 15h 36m 51s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 25s. [2026-03-25 23:56:44,696][__main__][INFO] - Starting iteration 598. [2026-03-25 23:56:44,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:56:44,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:56:51,550][__main__][INFO] - Number of regex retries in iteration 598: 0 [2026-03-25 23:56:51,551][__main__][INFO] - agents played in iteration 598 are Bob, Alice [2026-03-25 23:56:52,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:56:52,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:56:52,128][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:56:52,129][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:56:52,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:56:53,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:56:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:56:54,896][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:56:55,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:56:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:56:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:56:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:56:58,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:56:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:56:59,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:57:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:57:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:57:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:57:02,783][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:57:03,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:57:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:57:04,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:57:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:57:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:57:07,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:57:07,805][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:57:08,525][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:57:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:57:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:57:10,678][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:57:11,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:57:12,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:57:12,832][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:57:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:57:14,267][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:57:14,983][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:57:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:57:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:57:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:57:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:57:18,570][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:57:19,287][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:57:20,006][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:57:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:57:21,440][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:57:22,158][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:57:22,876][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:57:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:57:24,311][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:57:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:57:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:57:26,463][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:57:27,409][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:57:28,129][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:57:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:57:29,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:57:30,282][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:57:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:57:31,718][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:57:32,436][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:57:33,156][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:57:33,873][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:57:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:57:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:57:36,028][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:57:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:57:37,464][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:57:38,182][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:57:38,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:57:39,636][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:57:40,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:57:40,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:57:40,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:57:41,995][__main__][INFO] - Iteration 599 took 57s (11.96% Gen, 85.62% Train). Generation: 6s, Training: 49s. Estimated remaining time: 6h 21m 48s. Estimated total time: 15h 54m 56s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 29s, 500 more iterations: 7h 57m 28s. [2026-03-25 23:57:41,999][__main__][INFO] - Starting iteration 599. [2026-03-25 23:57:42,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:57:42,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:57:47,063][__main__][INFO] - Number of regex retries in iteration 599: 0 [2026-03-25 23:57:47,064][__main__][INFO] - agents played in iteration 599 are Bob, Alice [2026-03-25 23:57:47,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:57:47,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:57:47,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:57:47,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:57:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:57:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:57:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:57:50,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:57:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:57:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:57:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:57:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:57:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:57:54,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:57:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:57:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:57:56,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:57:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:57:58,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:57:59,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:57:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:58:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:58:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:58:01,882][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:58:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:58:03,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:58:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:58:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:58:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:58:06,179][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:58:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:58:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:58:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:58:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:58:09,767][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:58:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:58:11,203][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:58:11,920][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:58:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:58:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:58:14,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:58:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:58:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:58:16,226][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:58:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:58:17,662][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:58:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:58:19,098][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:58:19,817][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:58:20,533][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:58:21,252][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:58:21,970][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:58:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:58:23,638][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:58:24,356][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:58:25,074][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:58:25,792][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:58:26,511][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:58:27,228][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:58:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:58:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:58:29,385][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:58:30,104][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:58:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:58:31,542][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:58:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:58:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:58:33,698][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:58:34,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:58:35,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:58:36,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:58:36,168][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:58:36,169][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:58:37,503][__main__][INFO] - Iteration 600 took 55s (9.11% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 50m 57s. Estimated total time: 15h 25m 0s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 30s. [2026-03-25 23:58:37,505][__main__][INFO] - Starting iteration 600. [2026-03-25 23:58:37,509][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2026-03-25 23:58:37,510][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:58:42,446][__main__][INFO] - Number of regex retries in iteration 600: 0 [2026-03-25 23:58:42,448][__main__][INFO] - agents played in iteration 600 are Bob, Alice [2026-03-25 23:58:42,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:58:43,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:58:43,049][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:58:43,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:58:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:58:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:58:45,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:58:45,882][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:58:46,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:58:47,316][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:58:48,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:58:48,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:58:49,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:58:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:58:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:58:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:58:52,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:58:53,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:58:53,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:58:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:58:55,209][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:58:55,926][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:58:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:58:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:58:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:58:58,796][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:58:59,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:59:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:59:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:59:01,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:59:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 23:59:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 23:59:03,818][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 23:59:04,537][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 23:59:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 23:59:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 23:59:06,689][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 23:59:07,408][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 23:59:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 23:59:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 23:59:09,563][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 23:59:10,281][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 23:59:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 23:59:11,716][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 23:59:12,435][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 23:59:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 23:59:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 23:59:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 23:59:15,305][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 23:59:16,025][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 23:59:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 23:59:17,461][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 23:59:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 23:59:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 23:59:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 23:59:20,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 23:59:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 23:59:22,050][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 23:59:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 23:59:23,488][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 23:59:24,206][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 23:59:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 23:59:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 23:59:26,359][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 23:59:27,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 23:59:27,796][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 23:59:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 23:59:29,234][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 23:59:29,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 23:59:30,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-25 23:59:31,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 23:59:31,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 23:59:31,650][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 23:59:34,512][__main__][INFO] - Iteration 601 took 57s (8.66% Gen, 86.31% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 15m 4s. Estimated total time: 15h 50m 4s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 0s, 500 more iterations: 7h 55m 2s. [2026-03-25 23:59:34,516][__main__][INFO] - Starting iteration 601. [2026-03-25 23:59:34,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-25 23:59:34,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 23:59:39,404][__main__][INFO] - Number of regex retries in iteration 601: 0 [2026-03-25 23:59:39,405][__main__][INFO] - agents played in iteration 601 are Bob, Alice [2026-03-25 23:59:40,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:59:40,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-25 23:59:40,076][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-25 23:59:40,077][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-25 23:59:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 23:59:41,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 23:59:42,120][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 23:59:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 23:59:43,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 23:59:44,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 23:59:44,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 23:59:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 23:59:46,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 23:59:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 23:59:47,850][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 23:59:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 23:59:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 23:59:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 23:59:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 23:59:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 23:59:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 23:59:52,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 23:59:53,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 23:59:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 23:59:55,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 23:59:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 23:59:56,460][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 23:59:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 23:59:57,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 23:59:58,616][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 23:59:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:00:00,051][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:00:00,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:00:01,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:00:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:00:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:00:03,642][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:00:04,359][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:00:05,077][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:00:05,793][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:00:06,511][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:00:07,228][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:00:07,947][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:00:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:00:09,382][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:00:10,098][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:00:10,818][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:00:11,534][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:00:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:00:12,970][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:00:13,688][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:00:14,405][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:00:15,356][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:00:16,075][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:00:16,791][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:00:17,511][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:00:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:00:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:00:19,664][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:00:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:00:21,100][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:00:21,817][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:00:22,537][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:00:23,255][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:00:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:00:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:00:25,414][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:00:26,134][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:00:26,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:00:27,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:00:28,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:00:28,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:00:28,644][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:00:30,027][__main__][INFO] - Iteration 602 took 55s (8.80% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 49m 11s. Estimated total time: 15h 25m 7s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 33s. [2026-03-26 00:00:30,030][__main__][INFO] - Starting iteration 602. [2026-03-26 00:00:30,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:00:30,036][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:00:35,588][__main__][INFO] - Number of regex retries in iteration 602: 0 [2026-03-26 00:00:35,589][__main__][INFO] - agents played in iteration 602 are Bob, Alice [2026-03-26 00:00:36,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:00:36,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:00:36,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:00:36,165][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:00:36,840][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:00:37,485][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:00:38,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:00:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:00:39,638][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:00:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:00:41,070][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:00:41,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:00:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:00:43,220][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:00:43,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:00:44,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:00:45,374][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:00:46,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:00:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:00:47,524][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:00:48,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:00:48,959][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:00:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:00:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:00:51,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:00:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:00:52,546][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:00:53,265][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:00:53,984][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:00:54,701][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:00:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:00:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:00:56,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:00:57,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:00:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:00:59,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:00:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:01:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:01:01,158][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:01:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:01:02,592][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:01:03,310][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:01:04,026][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:01:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:01:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:01:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:01:06,898][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:01:07,616][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:01:08,333][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:01:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:01:09,770][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:01:10,487][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:01:11,434][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:01:12,153][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:01:12,871][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:01:13,589][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:01:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:01:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:01:15,742][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:01:16,461][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:01:17,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:01:17,898][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:01:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:01:19,333][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:01:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:01:20,768][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:01:21,486][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:01:22,205][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:01:22,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:01:23,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:01:24,647][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:01:24,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:01:24,651][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:01:26,212][__main__][INFO] - Iteration 603 took 56s (9.89% Gen, 87.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 59m 27s. Estimated total time: 15h 36m 19s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 9s. [2026-03-26 00:01:26,214][__main__][INFO] - Starting iteration 603. [2026-03-26 00:01:26,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:01:26,218][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:01:31,155][__main__][INFO] - Number of regex retries in iteration 603: 0 [2026-03-26 00:01:31,156][__main__][INFO] - agents played in iteration 603 are Bob, Alice [2026-03-26 00:01:31,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:01:31,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:01:31,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:01:31,727][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:01:32,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:01:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:01:33,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:01:34,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:01:35,223][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:01:35,939][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:01:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:01:37,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:01:38,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:01:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:01:39,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:01:40,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:01:40,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:01:41,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:01:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:01:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:01:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:01:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:01:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:01:45,985][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:01:46,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:01:47,421][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:01:48,140][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:01:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:01:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:01:50,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:01:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:01:51,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:01:52,448][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:01:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:01:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:01:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:01:55,320][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:01:56,035][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:01:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:01:57,470][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:01:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:01:58,904][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:01:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:02:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:02:01,059][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:02:01,776][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:02:02,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:02:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:02:03,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:02:04,645][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:02:05,364][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:02:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:02:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:02:07,798][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:02:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:02:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:02:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:02:10,672][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:02:11,392][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:02:12,113][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:02:12,831][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:02:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:02:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:02:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:02:15,711][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:02:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:02:17,148][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:02:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:02:18,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:02:19,330][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:02:20,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:02:20,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:02:20,524][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:02:22,245][__main__][INFO] - Iteration 604 took 56s (8.81% Gen, 88.11% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 56m 0s. Estimated total time: 15h 33m 48s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 54s. [2026-03-26 00:02:22,248][__main__][INFO] - Starting iteration 604. [2026-03-26 00:02:22,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:02:22,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:02:27,248][__main__][INFO] - Number of regex retries in iteration 604: 0 [2026-03-26 00:02:27,250][__main__][INFO] - agents played in iteration 604 are Bob, Alice [2026-03-26 00:02:27,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:02:27,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:02:27,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:02:27,825][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:02:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:02:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:02:29,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:02:30,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:02:31,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:02:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:02:32,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:02:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:02:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:02:34,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:02:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:02:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:02:37,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:02:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:02:38,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:02:39,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:02:39,956][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:02:40,675][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:02:41,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:02:42,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:02:42,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:02:43,542][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:02:44,261][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:02:44,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:02:45,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:02:46,412][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:02:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:02:47,848][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:02:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:02:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:02:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:02:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:02:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:02:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:02:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:02:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:02:54,311][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:02:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:02:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:02:56,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:02:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:02:57,904][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:02:58,621][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:02:59,338][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:03:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:03:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:03:01,491][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:03:02,208][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:03:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:03:03,876][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:03:04,594][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:03:05,311][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:03:06,030][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:03:06,747][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:03:07,465][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:03:08,183][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:03:08,899][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:03:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:03:10,335][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:03:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:03:11,773][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:03:12,494][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:03:13,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:03:13,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:03:14,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:03:15,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:03:16,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:03:16,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:03:16,514][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:03:17,967][__main__][INFO] - Iteration 605 took 55s (8.97% Gen, 88.42% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 49m 53s. Estimated total time: 15h 28m 36s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 18s. [2026-03-26 00:03:17,970][__main__][INFO] - Starting iteration 605. [2026-03-26 00:03:17,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:03:17,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:03:22,882][__main__][INFO] - Number of regex retries in iteration 605: 0 [2026-03-26 00:03:22,883][__main__][INFO] - agents played in iteration 605 are Bob, Alice [2026-03-26 00:03:23,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:03:23,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:03:23,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:03:23,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:03:24,139][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:03:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:03:25,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:03:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:03:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:03:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:03:28,371][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:03:29,087][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:03:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:03:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:03:31,235][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:03:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:03:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:03:33,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:03:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:03:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:03:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:03:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:03:36,973][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:03:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:03:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:03:39,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:03:39,845][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:03:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:03:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:03:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:03:42,716][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:03:43,433][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:03:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:03:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:03:45,587][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:03:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:03:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:03:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:03:48,459][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:03:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:03:49,892][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:03:50,608][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:03:51,327][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:03:52,043][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:03:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:03:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:03:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:03:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:03:55,633][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:03:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:03:57,068][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:03:57,786][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:03:58,737][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:03:59,457][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:04:00,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:04:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:04:01,609][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:04:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:04:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:04:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:04:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:04:05,201][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:04:05,918][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:04:06,637][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:04:07,354][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:04:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:04:08,791][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:04:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:04:10,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:04:10,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:04:11,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:04:11,935][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:04:11,936][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:04:13,248][__main__][INFO] - Iteration 606 took 55s (8.88% Gen, 88.74% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 41m 37s. Estimated total time: 15h 21m 16s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 38s. [2026-03-26 00:04:13,251][__main__][INFO] - Starting iteration 606. [2026-03-26 00:04:13,255][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:04:13,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:04:18,213][__main__][INFO] - Number of regex retries in iteration 606: 0 [2026-03-26 00:04:18,214][__main__][INFO] - agents played in iteration 606 are Bob, Alice [2026-03-26 00:04:18,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:04:18,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:04:18,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:04:18,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:04:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:04:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:04:20,832][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:04:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:04:22,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:04:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:04:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:04:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:04:25,131][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:04:25,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:04:26,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:04:27,280][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:04:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:04:28,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:04:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:04:30,151][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:04:30,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:04:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:04:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:04:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:04:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:04:34,456][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:04:35,172][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:04:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:04:36,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:04:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:04:38,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:04:38,762][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:04:39,482][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:04:40,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:04:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:04:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:04:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:04:43,073][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:04:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:04:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:04:45,227][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:04:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:04:46,661][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:04:47,377][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:04:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:04:48,815][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:04:49,530][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:04:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:04:50,965][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:04:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:04:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:04:53,117][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:04:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:04:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:04:55,592][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:04:56,310][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:04:57,027][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:04:57,743][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:04:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:04:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:04:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:05:00,615][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:05:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:05:02,052][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:05:02,770][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:05:03,489][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:05:04,206][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:05:04,925][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:05:05,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:05:06,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:05:07,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:05:07,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:05:07,423][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:05:09,556][__main__][INFO] - Iteration 607 took 56s (8.81% Gen, 87.40% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 57m 47s. Estimated total time: 15h 38m 23s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 11s. [2026-03-26 00:05:09,559][__main__][INFO] - Starting iteration 607. [2026-03-26 00:05:09,563][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:05:09,564][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:05:14,993][__main__][INFO] - Number of regex retries in iteration 607: 0 [2026-03-26 00:05:14,994][__main__][INFO] - agents played in iteration 607 are Bob, Alice [2026-03-26 00:05:15,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:05:15,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:05:15,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:05:15,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:05:16,255][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:05:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:05:17,619][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:05:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:05:19,051][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:05:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:05:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:05:21,199][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:05:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:05:22,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:05:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:05:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:05:24,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:05:25,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:05:26,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:05:26,927][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:05:27,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:05:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:05:29,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:05:29,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:05:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:05:31,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:05:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:05:32,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:05:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:05:34,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:05:34,817][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:05:35,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:05:36,252][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:05:36,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:05:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:05:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:05:39,122][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:05:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:05:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:05:41,281][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:05:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:05:42,722][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:05:43,441][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:05:44,161][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:05:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:05:45,602][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:05:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:05:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:05:47,758][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:05:48,476][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:05:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:05:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:05:50,871][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:05:51,591][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:05:52,308][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:05:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:05:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:05:54,459][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:05:55,177][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:05:55,893][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:05:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:05:57,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:05:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:05:58,765][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:05:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:06:00,200][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:06:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:06:01,635][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:06:02,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:06:03,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:06:04,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:06:04,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:06:04,087][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:06:07,291][__main__][INFO] - Iteration 608 took 57s (9.41% Gen, 85.04% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 20m 37s. Estimated total time: 16h 2m 10s. Time estimates for 10 more iterations: 9m 37s, 100 more iterations: 1h 36m 13s, 500 more iterations: 8h 1m 5s. [2026-03-26 00:06:07,294][__main__][INFO] - Starting iteration 608. [2026-03-26 00:06:07,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:06:07,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:06:13,078][__main__][INFO] - Number of regex retries in iteration 608: 0 [2026-03-26 00:06:13,079][__main__][INFO] - agents played in iteration 608 are Bob, Alice [2026-03-26 00:06:13,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:06:13,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:06:13,737][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:06:13,738][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:06:14,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:06:15,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:06:15,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:06:16,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:06:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:06:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:06:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:06:19,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:06:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:06:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:06:21,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:06:22,227][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:06:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:06:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:06:24,377][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:06:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:06:25,809][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:06:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:06:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:06:27,957][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:06:28,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:06:29,390][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:06:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:06:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:06:31,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:06:32,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:06:32,975][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:06:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:06:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:06:35,126][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:06:35,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:06:36,560][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:06:37,276][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:06:37,993][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:06:38,710][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:06:39,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:06:40,147][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:06:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:06:41,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:06:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:06:43,018][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:06:43,735][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:06:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:06:45,169][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:06:45,887][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:06:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:06:47,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:06:48,039][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:06:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:06:49,710][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:06:50,428][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:06:51,145][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:06:51,863][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:06:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:06:53,297][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:06:54,016][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:06:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:06:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:06:56,167][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:06:56,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:06:57,604][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:06:58,323][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:06:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:06:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:07:00,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:07:01,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:07:02,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:07:02,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:07:02,753][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:07:04,114][__main__][INFO] - Iteration 609 took 56s (10.17% Gen, 87.43% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 4m 28s. Estimated total time: 15h 46m 58s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 41s, 500 more iterations: 7h 53m 29s. [2026-03-26 00:07:04,116][__main__][INFO] - Starting iteration 609. [2026-03-26 00:07:04,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:07:04,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:07:09,175][__main__][INFO] - Number of regex retries in iteration 609: 0 [2026-03-26 00:07:09,176][__main__][INFO] - agents played in iteration 609 are Bob, Alice [2026-03-26 00:07:09,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:07:09,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:07:09,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:07:09,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:07:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:07:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:07:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:07:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:07:13,266][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:07:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:07:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:07:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:07:16,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:07:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:07:17,561][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:07:18,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:07:18,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:07:19,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:07:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:07:21,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:07:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:07:22,578][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:07:23,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:07:24,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:07:24,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:07:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:07:26,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:07:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:07:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:07:28,311][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:07:29,028][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:07:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:07:30,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:07:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:07:31,897][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:07:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:07:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:07:34,051][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:07:34,767][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:07:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:07:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:07:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:07:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:07:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:07:39,072][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:07:39,789][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:07:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:07:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:07:41,941][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:07:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:07:43,375][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:07:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:07:45,125][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:07:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:07:46,559][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:07:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:07:47,993][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:07:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:07:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:07:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:07:50,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:07:51,580][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:07:52,298][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:07:53,014][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:07:53,732][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:07:54,452][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:07:55,170][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:07:55,890][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:07:56,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:07:57,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:07:58,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:07:58,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:07:58,331][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:07:59,676][__main__][INFO] - Iteration 610 took 55s (9.10% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 42m 32s. Estimated total time: 15h 25m 57s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 58s. [2026-03-26 00:07:59,679][__main__][INFO] - Starting iteration 610. [2026-03-26 00:07:59,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:07:59,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:08:04,714][__main__][INFO] - Number of regex retries in iteration 610: 0 [2026-03-26 00:08:04,715][__main__][INFO] - agents played in iteration 610 are Bob, Alice [2026-03-26 00:08:05,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:08:05,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:08:05,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:08:05,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:08:06,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:08:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:08:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:08:08,115][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:08:08,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:08:09,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:08:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:08:10,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:08:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:08:12,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:08:13,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:08:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:08:14,572][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:08:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:08:16,011][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:08:16,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:08:17,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:08:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:08:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:08:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:08:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:08:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:08:21,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:08:22,476][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:08:23,194][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:08:23,914][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:08:24,632][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:08:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:08:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:08:26,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:08:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:08:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:08:28,945][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:08:29,665][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:08:30,383][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:08:31,101][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:08:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:08:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:08:33,258][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:08:33,978][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:08:34,696][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:08:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:08:36,136][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:08:36,853][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:08:37,572][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:08:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:08:39,012][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:08:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:08:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:08:41,409][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:08:42,127][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:08:42,846][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:08:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:08:44,281][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:08:44,999][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:08:45,719][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:08:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:08:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:08:47,874][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:08:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:08:49,313][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:08:50,032][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:08:50,752][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:08:51,473][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:08:52,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:08:52,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:08:53,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:08:53,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:08:53,952][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:08:55,541][__main__][INFO] - Iteration 611 took 55s (9.01% Gen, 88.14% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 46m 38s. Estimated total time: 15h 30m 59s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 29s. [2026-03-26 00:08:55,544][__main__][INFO] - Starting iteration 611. [2026-03-26 00:08:55,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:08:55,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:09:00,555][__main__][INFO] - Number of regex retries in iteration 611: 0 [2026-03-26 00:09:00,556][__main__][INFO] - agents played in iteration 611 are Bob, Alice [2026-03-26 00:09:01,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:09:01,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:09:01,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:09:01,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:09:01,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:09:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:09:03,220][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:09:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:09:04,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:09:05,369][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:09:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:09:06,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:09:07,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:09:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:09:08,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:09:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:09:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:09:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:09:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:09:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:09:13,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:09:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:09:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:09:15,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:09:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:09:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:09:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:09:18,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:09:19,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:09:19,733][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:09:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:09:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:09:21,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:09:22,608][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:09:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:09:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:09:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:09:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:09:26,201][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:09:26,921][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:09:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:09:28,359][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:09:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:09:29,796][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:09:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:09:31,234][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:09:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:09:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:09:33,390][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:09:34,108][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:09:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:09:35,545][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:09:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:09:37,223][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:09:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:09:38,659][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:09:39,378][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:09:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:09:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:09:41,534][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:09:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:09:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:09:43,690][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:09:44,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:09:45,128][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:09:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:09:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:09:47,283][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:09:48,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:09:48,792][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:09:49,796][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:09:49,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:09:49,801][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:09:51,139][__main__][INFO] - Iteration 612 took 55s (9.01% Gen, 88.58% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 41m 16s. Estimated total time: 15h 26m 33s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 16s. [2026-03-26 00:09:51,142][__main__][INFO] - Starting iteration 612. [2026-03-26 00:09:51,146][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:09:51,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:09:56,890][__main__][INFO] - Number of regex retries in iteration 612: 0 [2026-03-26 00:09:56,893][__main__][INFO] - agents played in iteration 612 are Bob, Alice [2026-03-26 00:09:57,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:09:57,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:09:57,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:09:57,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:09:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:09:58,878][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:09:59,596][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:10:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:10:01,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:10:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:10:02,464][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:10:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:10:03,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:10:04,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:10:05,333][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:10:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:10:06,768][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:10:07,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:10:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:10:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:10:09,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:10:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:10:11,078][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:10:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:10:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:10:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:10:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:10:14,668][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:10:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:10:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:10:16,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:10:17,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:10:18,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:10:18,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:10:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:10:20,418][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:10:21,135][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:10:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:10:22,575][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:10:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:10:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:10:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:10:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:10:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:10:26,886][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:10:27,606][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:10:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:10:29,042][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:10:29,761][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:10:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:10:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:10:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:10:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:10:33,642][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:10:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:10:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:10:35,803][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:10:36,521][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:10:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:10:37,958][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:10:38,678][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:10:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:10:40,115][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:10:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:10:41,553][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:10:42,271][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:10:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:10:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:10:44,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:10:45,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:10:46,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:10:46,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:10:46,318][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:10:47,606][__main__][INFO] - Iteration 613 took 56s (10.18% Gen, 87.54% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 54m 47s. Estimated total time: 15h 41m 1s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 6s, 500 more iterations: 7h 50m 30s. [2026-03-26 00:10:47,608][__main__][INFO] - Starting iteration 613. [2026-03-26 00:10:47,616][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:10:47,617][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:10:52,728][__main__][INFO] - Number of regex retries in iteration 613: 0 [2026-03-26 00:10:52,729][__main__][INFO] - agents played in iteration 613 are Bob, Alice [2026-03-26 00:10:53,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:10:53,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:10:53,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:10:53,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:10:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:10:54,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:10:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:10:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:10:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:10:57,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:10:58,282][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:10:58,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:10:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:11:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:11:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:11:01,867][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:11:02,586][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:11:03,303][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:11:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:11:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:11:05,456][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:11:06,174][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:11:06,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:11:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:11:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:11:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:11:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:11:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:11:11,202][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:11:11,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:11:12,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:11:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:11:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:11:14,799][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:11:15,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:11:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:11:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:11:17,674][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:11:18,393][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:11:19,114][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:11:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:11:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:11:21,272][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:11:21,990][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:11:22,708][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:11:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:11:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:11:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:11:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:11:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:11:27,018][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:11:27,737][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:11:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:11:29,416][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:11:30,133][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:11:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:11:31,572][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:11:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:11:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:11:33,729][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:11:34,447][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:11:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:11:35,884][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:11:36,603][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:11:37,322][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:11:38,041][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:11:38,762][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:11:39,479][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:11:40,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:11:40,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:11:41,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:11:42,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:11:42,004][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:11:43,622][__main__][INFO] - Iteration 614 took 56s (9.13% Gen, 87.98% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 46m 19s. Estimated total time: 15h 33m 28s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 44s. [2026-03-26 00:11:43,625][__main__][INFO] - Starting iteration 614. [2026-03-26 00:11:43,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:11:43,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:11:48,762][__main__][INFO] - Number of regex retries in iteration 614: 0 [2026-03-26 00:11:48,763][__main__][INFO] - agents played in iteration 614 are Bob, Alice [2026-03-26 00:11:49,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:11:49,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:11:49,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:11:49,384][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:11:50,125][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:11:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:11:51,492][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:11:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:11:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:11:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:11:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:11:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:11:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:11:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:11:57,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:11:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:11:58,672][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:11:59,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:12:00,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:12:00,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:12:01,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:12:02,265][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:12:02,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:12:03,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:12:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:12:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:12:05,862][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:12:06,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:12:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:12:08,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:12:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:12:09,456][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:12:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:12:10,892][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:12:11,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:12:12,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:12:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:12:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:12:14,489][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:12:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:12:15,927][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:12:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:12:17,367][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:12:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:12:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:12:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:12:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:12:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:12:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:12:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:12:23,113][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:12:23,833][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:12:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:12:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:12:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:12:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:12:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:12:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:12:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:12:29,822][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:12:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:12:31,260][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:12:31,979][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:12:32,696][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:12:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:12:34,133][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:12:34,852][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:12:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:12:36,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:12:37,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:12:38,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:12:38,119][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:12:38,121][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:12:39,434][__main__][INFO] - Iteration 615 took 55s (9.20% Gen, 88.44% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 42m 2s. Estimated total time: 15h 30m 7s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 3s. [2026-03-26 00:12:39,438][__main__][INFO] - Starting iteration 615. [2026-03-26 00:12:39,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:12:39,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:12:44,519][__main__][INFO] - Number of regex retries in iteration 615: 0 [2026-03-26 00:12:44,520][__main__][INFO] - agents played in iteration 615 are Bob, Alice [2026-03-26 00:12:45,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:12:45,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:12:45,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:12:45,224][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:12:45,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:12:46,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:12:47,348][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:12:48,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:12:48,781][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:12:49,499][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:12:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:12:50,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:12:51,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:12:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:12:53,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:12:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:12:54,524][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:12:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:12:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:12:56,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:12:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:12:58,112][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:12:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:12:59,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:13:00,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:13:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:13:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:13:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:13:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:13:03,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:13:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:13:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:13:06,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:13:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:13:07,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:13:08,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:13:08,891][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:13:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:13:10,326][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:13:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:13:11,763][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:13:12,482][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:13:13,201][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:13:13,918][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:13:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:13:15,355][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:13:16,076][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:13:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:13:17,511][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:13:18,231][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:13:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:13:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:13:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:13:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:13:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:13:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:13:23,552][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:13:24,272][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:13:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:13:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:13:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:13:27,148][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:13:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:13:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:13:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:13:30,023][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:13:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:13:31,461][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:13:32,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:13:32,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:13:34,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:13:34,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:13:34,018][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:13:35,299][__main__][INFO] - Iteration 616 took 55s (9.09% Gen, 88.61% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 41m 57s. Estimated total time: 15h 30m 58s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 29s. [2026-03-26 00:13:35,302][__main__][INFO] - Starting iteration 616. [2026-03-26 00:13:35,306][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:13:35,307][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:13:40,377][__main__][INFO] - Number of regex retries in iteration 616: 0 [2026-03-26 00:13:40,379][__main__][INFO] - agents played in iteration 616 are Bob, Alice [2026-03-26 00:13:40,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:13:40,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:13:40,965][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:13:40,965][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:13:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:13:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:13:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:13:43,762][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:13:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:13:45,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:13:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:13:46,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:13:47,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:13:48,067][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:13:48,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:13:49,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:13:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:13:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:13:51,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:13:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:13:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:13:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:13:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:13:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:13:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:13:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:13:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:13:58,123][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:13:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:13:59,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:14:00,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:14:00,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:14:01,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:14:02,437][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:14:06,254][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:14:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:14:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:14:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:14:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:14:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:14:10,561][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:14:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:14:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:14:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:14:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:14:14,151][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:14:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:14:15,589][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:14:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:14:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:14:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:14:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:14:19,421][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:14:20,141][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:14:20,858][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:14:21,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:14:22,294][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:14:23,012][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:14:23,732][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:14:24,449][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:14:25,170][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:14:25,889][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:14:26,607][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:14:27,327][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:14:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:14:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:14:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:14:30,201][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:14:30,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:14:31,684][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:50 [2026-03-26 00:14:32,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:14:32,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:14:32,703][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:14:34,102][__main__][INFO] - Iteration 617 took 58s (8.63% Gen, 88.99% Train). Generation: 5s, Training: 52s. Estimated remaining time: 6h 29m 58s. Estimated total time: 16h 19m 57s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 59s, 500 more iterations: 8h 9m 58s. [2026-03-26 00:14:34,106][__main__][INFO] - Starting iteration 617. [2026-03-26 00:14:34,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:14:34,111][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:14:39,523][__main__][INFO] - Number of regex retries in iteration 617: 0 [2026-03-26 00:14:39,524][__main__][INFO] - agents played in iteration 617 are Bob, Alice [2026-03-26 00:14:40,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:14:40,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:14:40,132][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:14:40,133][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:14:40,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:14:41,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:14:42,209][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:14:42,924][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:14:43,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:14:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:14:45,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:14:45,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:14:46,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:14:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:14:47,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:14:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:14:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:14:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:14:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:14:51,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:14:52,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:14:52,964][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:14:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:14:54,403][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:14:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:14:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:14:56,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:14:57,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:14:57,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:14:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:14:59,434][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:15:00,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:15:00,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:15:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:15:02,311][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:15:03,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:15:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:15:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:15:05,185][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:15:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:15:06,623][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:15:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:15:08,059][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:15:08,776][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:15:09,493][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:15:10,209][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:15:10,927][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:15:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:15:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:15:13,076][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:15:13,794][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:15:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:15:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:15:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:15:16,891][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:15:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:15:18,325][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:15:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:15:19,759][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:15:20,475][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:15:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:15:21,912][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:15:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:15:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:15:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:15:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:15:25,497][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:15:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:15:26,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:15:27,697][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:15:28,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:15:29,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:15:29,031][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:15:30,600][__main__][INFO] - Iteration 618 took 56s (9.58% Gen, 87.63% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 50m 36s. Estimated total time: 15h 41m 32s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 46s. [2026-03-26 00:15:30,603][__main__][INFO] - Starting iteration 618. [2026-03-26 00:15:30,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:15:30,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:15:35,678][__main__][INFO] - Number of regex retries in iteration 618: 0 [2026-03-26 00:15:35,679][__main__][INFO] - agents played in iteration 618 are Bob, Alice [2026-03-26 00:15:36,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:15:36,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:15:36,247][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:15:36,248][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:15:36,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:15:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:15:38,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:15:39,012][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:15:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:15:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:15:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:15:41,875][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:15:42,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:15:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:15:44,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:15:44,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:15:45,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:15:46,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:15:46,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:15:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:15:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:15:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:15:49,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:15:50,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:15:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:15:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:15:52,623][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:15:53,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:15:54,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:15:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:15:55,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:15:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:15:56,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:15:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:15:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:15:59,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:15:59,807][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:16:00,526][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:16:01,245][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:16:01,963][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:16:02,682][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:16:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:16:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:16:04,837][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:16:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:16:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:16:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:16:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:16:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:16:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:16:09,864][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:16:10,583][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:16:11,623][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:16:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:16:13,060][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:16:13,779][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:16:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:16:15,217][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:16:15,936][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:16:16,655][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:16:17,373][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:16:18,093][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:16:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:16:19,530][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:16:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:16:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:16:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:16:22,406][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:16:23,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:16:23,906][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:16:24,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:16:24,946][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:16:24,947][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:16:26,207][__main__][INFO] - Iteration 619 took 55s (9.10% Gen, 88.63% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 34m 38s. Estimated total time: 15h 26m 30s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 15s. [2026-03-26 00:16:26,210][__main__][INFO] - Starting iteration 619. [2026-03-26 00:16:26,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:16:26,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:16:31,314][__main__][INFO] - Number of regex retries in iteration 619: 0 [2026-03-26 00:16:31,315][__main__][INFO] - agents played in iteration 619 are Bob, Alice [2026-03-26 00:16:31,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:16:31,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:16:31,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:16:31,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:16:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:16:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:16:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:16:34,695][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:16:35,412][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:16:36,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:16:36,847][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:16:37,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:16:38,282][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:16:38,999][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:16:39,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:16:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:16:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:16:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:16:42,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:16:43,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:16:44,023][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:16:44,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:16:45,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:16:46,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:16:46,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:16:47,614][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:16:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:16:49,051][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:16:49,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:16:50,487][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:16:51,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:16:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:16:52,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:16:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:16:54,081][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:16:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:16:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:16:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:16:56,957][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:16:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:16:58,391][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:16:59,109][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:16:59,826][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:17:00,544][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:17:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:17:01,980][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:17:02,698][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:17:03,418][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:17:04,136][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:17:04,854][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:17:05,573][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:17:06,291][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:17:07,244][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:17:07,965][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:17:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:17:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:17:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:17:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:17:13,700][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:17:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:17:15,136][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:17:15,852][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:17:16,569][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:17:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:17:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:17:18,721][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:17:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:17:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:17:20,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:17:21,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-26 00:17:22,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:17:22,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:17:22,660][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:17:24,063][__main__][INFO] - Iteration 620 took 57s (8.82% Gen, 88.76% Train). Generation: 5s, Training: 51s. Estimated remaining time: 6h 11m 20s. Estimated total time: 16h 4m 10s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 25s, 500 more iterations: 8h 2m 5s. [2026-03-26 00:17:24,067][__main__][INFO] - Starting iteration 620. [2026-03-26 00:17:24,074][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:17:24,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:17:28,989][__main__][INFO] - Number of regex retries in iteration 620: 0 [2026-03-26 00:17:28,990][__main__][INFO] - agents played in iteration 620 are Bob, Alice [2026-03-26 00:17:29,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:17:29,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:17:29,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:17:29,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:17:30,465][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:17:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:17:31,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:17:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:17:33,258][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:17:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:17:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:17:35,403][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:17:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:17:36,834][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:17:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:17:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:17:38,983][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:17:39,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:17:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:17:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:17:41,852][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:17:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:17:43,293][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:17:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:17:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:17:45,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:17:46,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:17:46,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:17:47,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:17:48,334][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:17:49,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:17:49,777][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:17:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:17:51,218][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:17:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:17:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:17:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:17:54,087][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:17:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:17:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:17:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:17:56,955][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:17:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:17:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:17:59,108][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:17:59,825][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:18:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:18:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:18:01,978][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:18:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:18:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:18:04,133][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:18:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:18:05,800][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:18:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:18:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:18:07,952][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:18:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:18:09,388][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:18:10,107][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:18:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:18:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:18:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:18:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:18:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:18:14,412][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:18:15,131][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:18:15,849][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:18:16,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:18:17,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:18:18,369][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:18:18,372][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:18:18,373][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:18:22,146][__main__][INFO] - Iteration 621 took 58s (8.46% Gen, 85.03% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 14m 7s. Estimated total time: 16h 7m 54s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 47s, 500 more iterations: 8h 3m 57s. [2026-03-26 00:18:22,148][__main__][INFO] - Starting iteration 621. [2026-03-26 00:18:22,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:18:22,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:18:27,237][__main__][INFO] - Number of regex retries in iteration 621: 0 [2026-03-26 00:18:27,242][__main__][INFO] - agents played in iteration 621 are Bob, Alice [2026-03-26 00:18:27,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:18:27,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:18:27,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:18:27,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:18:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:18:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:18:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:18:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:18:31,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:18:32,038][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:18:32,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:18:33,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:18:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:18:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:18:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:18:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:18:37,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:18:37,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:18:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:18:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:18:39,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:18:40,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:18:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:18:42,055][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:18:42,770][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:18:43,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:18:44,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:18:44,917][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:18:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:18:46,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:18:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:18:47,782][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:18:48,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:18:49,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:18:49,934][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:18:50,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:18:51,369][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:18:52,085][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:18:52,802][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:18:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:18:54,236][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:18:54,953][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:18:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:18:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:18:57,103][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:18:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:18:58,537][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:18:59,255][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:18:59,972][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:19:00,691][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:19:01,406][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:19:02,126][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:19:03,165][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:19:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:19:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:19:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:19:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:19:06,756][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:19:07,471][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:19:08,189][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:19:08,907][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:19:09,625][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:19:10,340][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:19:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:19:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:19:12,492][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:19:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:19:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:19:14,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:19:15,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:19:16,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:19:16,505][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:19:16,506][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:19:17,925][__main__][INFO] - Iteration 622 took 55s (9.12% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 34m 51s. Estimated total time: 15h 29m 34s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 47s. [2026-03-26 00:19:17,928][__main__][INFO] - Starting iteration 622. [2026-03-26 00:19:17,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:19:17,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:19:22,943][__main__][INFO] - Number of regex retries in iteration 622: 0 [2026-03-26 00:19:22,944][__main__][INFO] - agents played in iteration 622 are Bob, Alice [2026-03-26 00:19:23,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:19:23,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:19:23,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:19:23,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:19:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:19:24,996][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:19:25,715][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:19:26,431][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:19:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:19:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:19:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:19:29,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:19:30,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:19:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:19:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:19:32,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:19:32,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:19:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:19:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:19:35,039][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:19:35,757][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:19:36,474][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:19:37,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:19:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:19:38,628][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:19:39,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:19:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:19:40,782][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:19:41,501][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:19:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:19:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:19:43,655][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:19:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:19:45,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:19:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:19:46,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:19:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:19:47,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:19:48,683][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:19:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:19:50,121][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:19:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:19:51,560][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:19:52,279][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:19:52,998][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:19:53,716][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:19:54,436][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:19:55,155][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:19:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:19:56,594][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:19:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:19:58,029][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:19:58,995][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:19:59,715][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:20:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:20:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:20:01,870][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:20:02,589][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:20:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:20:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:20:04,744][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:20:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:20:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:20:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:20:07,618][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:20:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:20:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:20:09,773][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:20:10,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:20:11,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:20:12,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:20:12,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:20:12,294][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:20:13,713][__main__][INFO] - Iteration 623 took 55s (8.98% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 34m 3s. Estimated total time: 15h 29m 42s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 51s. [2026-03-26 00:20:13,716][__main__][INFO] - Starting iteration 623. [2026-03-26 00:20:13,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:20:13,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:20:18,789][__main__][INFO] - Number of regex retries in iteration 623: 0 [2026-03-26 00:20:18,790][__main__][INFO] - agents played in iteration 623 are Bob, Alice [2026-03-26 00:20:19,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:20:19,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:20:19,427][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:20:19,428][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:20:20,139][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:20:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:20:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:20:22,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:20:22,939][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:20:23,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:20:24,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:20:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:20:25,808][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:20:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:20:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:20:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:20:28,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:20:29,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:20:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:20:30,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:20:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:20:32,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:20:32,985][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:20:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:20:34,419][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:20:35,139][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:20:35,857][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:20:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:20:37,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:20:38,013][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:20:38,733][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:20:39,452][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:20:40,171][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:20:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:20:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:20:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:20:43,041][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:20:43,759][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:20:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:20:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:20:45,913][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:20:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:20:47,348][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:20:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:20:48,784][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:20:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:20:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:20:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:20:51,657][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:20:52,375][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:20:53,094][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:20:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:20:54,770][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:20:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:20:56,207][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:20:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:20:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:20:58,364][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:20:59,084][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:20:59,802][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:21:00,521][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:21:01,240][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:21:01,958][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:21:02,678][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:21:03,397][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:21:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:21:04,831][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:21:05,550][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:21:06,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:21:07,009][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:21:08,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:21:08,095][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:21:08,097][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:21:10,655][__main__][INFO] - Iteration 624 took 56s (8.90% Gen, 86.60% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 52m 21s. Estimated total time: 15h 48m 57s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 53s, 500 more iterations: 7h 54m 28s. [2026-03-26 00:21:10,658][__main__][INFO] - Starting iteration 624. [2026-03-26 00:21:10,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:21:10,663][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:21:15,859][__main__][INFO] - Number of regex retries in iteration 624: 0 [2026-03-26 00:21:15,861][__main__][INFO] - agents played in iteration 624 are Bob, Alice [2026-03-26 00:21:16,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:21:16,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:21:16,442][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:21:16,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:21:17,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:21:17,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:21:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:21:19,206][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:21:19,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:21:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:21:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:21:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:21:22,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:21:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:21:24,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:21:24,938][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:21:25,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:21:26,372][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:21:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:21:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:21:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:21:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:21:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:21:30,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:21:31,398][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:21:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:21:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:21:33,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:21:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:21:34,989][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:21:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:21:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:21:37,142][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:21:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:21:38,581][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:21:39,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:21:40,017][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:21:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:21:41,455][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:21:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:21:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:21:43,606][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:21:44,324][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:21:45,042][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:21:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:21:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:21:47,196][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:21:47,913][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:21:48,630][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:21:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:21:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:21:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:21:51,819][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:21:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:21:53,253][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:21:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:21:54,690][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:21:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:21:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:21:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:21:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:21:58,282][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:21:58,998][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:21:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:22:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:22:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:22:01,870][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:22:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:22:03,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:22:04,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:22:05,039][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:22:05,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:22:05,044][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:22:10,377][__main__][INFO] - Iteration 625 took 59s (8.70% Gen, 82.36% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 37m 41s. Estimated total time: 16h 35m 17s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 31s, 500 more iterations: 8h 17m 38s. [2026-03-26 00:22:10,380][__main__][INFO] - Starting iteration 625. [2026-03-26 00:22:10,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:22:10,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:22:15,290][__main__][INFO] - Number of regex retries in iteration 625: 0 [2026-03-26 00:22:15,291][__main__][INFO] - agents played in iteration 625 are Bob, Alice [2026-03-26 00:22:15,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:22:15,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:22:15,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:22:15,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:22:16,556][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:22:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:22:17,917][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:22:18,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:22:19,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:22:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:22:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:22:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:22:22,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:22:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:22:23,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:22:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:22:25,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:22:25,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:22:26,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:22:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:22:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:22:28,640][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:22:29,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:22:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:22:30,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:22:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:22:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:22:32,935][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:22:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:22:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:22:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:22:35,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:22:36,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:22:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:22:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:22:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:22:39,385][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:22:40,100][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:22:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:22:41,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:22:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:22:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:22:43,684][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:22:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:22:45,121][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:22:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:22:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:22:47,275][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:22:47,993][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:22:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:22:49,431][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:22:50,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:22:51,099][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:22:51,817][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:22:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:22:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:22:53,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:22:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:22:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:22:56,121][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:22:56,838][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:22:57,556][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:22:58,274][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:22:58,990][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:22:59,711][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:23:00,427][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:23:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:23:01,863][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:23:02,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:23:03,318][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:23:04,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:23:04,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:23:04,379][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:23:06,477][__main__][INFO] - Iteration 626 took 56s (8.74% Gen, 87.51% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 36m 23s. Estimated total time: 15h 34m 55s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 27s. [2026-03-26 00:23:06,480][__main__][INFO] - Starting iteration 626. [2026-03-26 00:23:06,483][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:23:06,484][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:23:11,563][__main__][INFO] - Number of regex retries in iteration 626: 0 [2026-03-26 00:23:11,564][__main__][INFO] - agents played in iteration 626 are Bob, Alice [2026-03-26 00:23:12,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:23:12,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:23:12,167][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:23:12,167][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:23:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:23:13,510][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:23:14,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:23:14,942][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:23:15,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:23:16,373][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:23:17,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:23:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:23:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:23:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:23:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:23:20,665][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:23:21,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:23:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:23:22,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:23:23,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:23:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:23:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:23:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:23:26,398][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:23:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:23:27,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:23:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:23:29,265][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:23:29,981][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:23:30,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:23:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:23:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:23:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:23:33,567][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:23:34,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:23:35,001][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:23:35,719][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:23:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:23:37,152][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:23:37,870][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:23:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:23:39,307][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:23:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:23:40,742][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:23:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:23:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:23:42,892][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:23:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:23:44,326][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:23:45,043][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:23:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:23:46,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:23:47,429][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:23:48,146][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:23:48,865][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:23:49,585][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:23:50,304][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:23:51,021][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:23:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:23:52,458][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:23:53,177][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:23:53,896][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:23:54,614][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:23:55,333][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:23:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:23:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:23:57,489][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:23:58,206][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:23:58,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:23:59,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:24:00,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:24:00,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:24:00,620][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:24:02,120][__main__][INFO] - Iteration 627 took 55s (9.13% Gen, 88.17% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 27m 50s. Estimated total time: 15h 27m 18s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 39s. [2026-03-26 00:24:02,124][__main__][INFO] - Starting iteration 627. [2026-03-26 00:24:02,132][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:24:02,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:24:07,249][__main__][INFO] - Number of regex retries in iteration 627: 0 [2026-03-26 00:24:07,250][__main__][INFO] - agents played in iteration 627 are Bob, Alice [2026-03-26 00:24:07,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:24:07,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:24:07,816][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:24:07,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:24:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:24:09,150][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:24:09,869][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:24:10,583][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:24:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:24:12,017][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:24:12,736][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:24:13,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:24:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:24:14,886][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:24:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:24:16,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:24:17,037][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:24:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:24:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:24:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:24:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:24:20,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:24:21,338][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:24:22,055][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:24:22,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:24:23,490][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:24:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:24:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:24:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:24:26,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:24:27,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:24:27,794][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:24:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:24:29,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:24:29,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:24:30,669][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:24:31,384][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:24:32,103][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:24:32,818][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:24:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:24:34,252][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:24:34,969][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:24:35,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:24:36,404][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:24:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:24:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:24:38,553][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:24:39,272][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:24:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:24:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:24:41,427][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:24:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:24:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:24:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:24:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:24:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:24:46,046][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:24:46,764][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:24:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:24:48,198][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:24:48,915][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:24:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:24:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:24:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:24:51,786][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:24:52,505][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:24:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:24:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:24:54,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:24:55,407][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:24:56,411][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:24:56,414][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:24:56,415][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:24:57,933][__main__][INFO] - Iteration 628 took 55s (9.17% Gen, 88.10% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 29m 42s. Estimated total time: 15h 30m 5s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 2s. [2026-03-26 00:24:57,936][__main__][INFO] - Starting iteration 628. [2026-03-26 00:24:57,963][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:24:57,964][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:25:03,025][__main__][INFO] - Number of regex retries in iteration 628: 0 [2026-03-26 00:25:03,026][__main__][INFO] - agents played in iteration 628 are Bob, Alice [2026-03-26 00:25:03,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:25:03,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:25:03,877][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:25:03,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:25:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:25:05,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:25:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:25:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:25:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:25:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:25:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:25:09,527][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:25:10,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:25:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:25:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:25:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:25:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:25:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:25:14,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:25:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:25:15,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:25:16,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:25:17,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:25:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:25:18,846][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:25:19,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:25:20,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:25:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:25:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:25:22,434][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:25:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:25:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:25:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:25:25,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:25:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:25:26,739][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:25:27,455][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:25:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:25:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:25:29,605][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:25:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:25:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:25:31,753][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:25:32,471][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:25:33,188][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:25:33,904][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:25:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:25:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:25:36,057][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:25:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:25:37,492][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:25:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:25:39,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:25:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:25:40,607][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:25:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:25:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:25:42,760][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:25:43,480][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:25:44,200][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:25:44,919][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:25:45,637][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:25:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:25:47,075][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:25:47,791][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:25:48,510][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:25:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:25:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:25:50,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:25:51,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:25:52,537][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:25:52,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:25:52,543][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:25:56,342][__main__][INFO] - Iteration 629 took 58s (8.67% Gen, 84.82% Train). Generation: 5s, Training: 49s. Estimated remaining time: 6h 11m 38s. Estimated total time: 16h 13m 0s. Time estimates for 10 more iterations: 9m 43s, 100 more iterations: 1h 37m 18s, 500 more iterations: 8h 6m 30s. [2026-03-26 00:25:56,345][__main__][INFO] - Starting iteration 629. [2026-03-26 00:25:56,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:25:56,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:26:01,406][__main__][INFO] - Number of regex retries in iteration 629: 0 [2026-03-26 00:26:01,408][__main__][INFO] - agents played in iteration 629 are Bob, Alice [2026-03-26 00:26:01,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:26:02,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:26:02,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:26:02,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:26:02,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:26:03,343][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:26:04,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:26:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:26:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:26:06,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:26:06,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:26:07,636][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:26:08,354][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:26:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:26:09,788][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:26:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:26:11,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:26:11,936][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:26:12,650][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:26:13,369][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:26:14,084][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:26:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:26:15,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:26:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:26:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:26:17,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:26:18,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:26:19,101][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:26:19,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:26:20,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:26:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:26:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:26:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:26:23,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:26:24,120][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:26:24,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:26:25,554][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:26:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:26:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:26:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:26:28,421][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:26:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:26:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:26:30,571][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:26:31,287][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:26:32,007][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:26:32,724][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:26:33,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:26:34,157][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:26:34,874][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:26:35,590][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:26:36,307][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:26:37,256][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:26:37,974][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:26:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:26:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:26:40,128][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:26:40,844][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:26:41,562][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:26:42,278][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:26:42,996][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:26:43,713][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:26:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:26:45,148][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:26:45,866][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:26:46,582][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:26:47,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:26:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:26:49,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:26:50,338][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-26 00:26:51,383][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:26:51,540][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:26:53,829][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:26:55,698][__main__][INFO] - Iteration 630 took 59s (8.52% Gen, 88.33% Train). Generation: 5s, Training: 52s. Estimated remaining time: 6h 26m 49s. Estimated total time: 16h 29m 10s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 55s, 500 more iterations: 8h 14m 35s. [2026-03-26 00:26:57,955][__main__][INFO] - Starting iteration 630. [2026-03-26 00:26:57,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:26:57,963][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:27:03,159][__main__][INFO] - Number of regex retries in iteration 630: 0 [2026-03-26 00:27:03,161][__main__][INFO] - agents played in iteration 630 are Bob, Alice [2026-03-26 00:27:03,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:27:03,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:27:03,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:27:03,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:27:04,493][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:27:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:27:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:27:06,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:27:07,279][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:27:07,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:27:08,710][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:27:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:27:10,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:27:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:27:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:27:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:27:12,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:27:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:27:14,427][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:27:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:27:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:27:16,572][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:27:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:27:18,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:27:18,717][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:27:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:27:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:27:20,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:27:21,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:27:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:27:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:27:23,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:27:24,448][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:27:25,162][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:27:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:27:26,595][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:27:27,312][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:27:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:27:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:27:29,460][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:27:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:27:30,893][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:27:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:27:32,326][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:27:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:27:33,759][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:27:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:27:35,192][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:27:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:27:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:27:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:27:38,060][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:27:39,098][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:27:39,818][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:27:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:27:41,251][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:27:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:27:42,686][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:27:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:27:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:27:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:27:45,550][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:27:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:27:46,982][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:27:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:27:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:27:49,133][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:27:49,849][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:27:50,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:27:51,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:27:52,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:27:52,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:27:52,424][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:27:55,332][__main__][INFO] - Iteration 631 took 57s (9.06% Gen, 85.86% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 52m 52s. Estimated total time: 15h 56m 13s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 37s, 500 more iterations: 7h 58m 6s. [2026-03-26 00:27:55,336][__main__][INFO] - Starting iteration 631. [2026-03-26 00:27:55,340][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:27:55,341][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:28:00,971][__main__][INFO] - Number of regex retries in iteration 631: 0 [2026-03-26 00:28:00,972][__main__][INFO] - agents played in iteration 631 are Bob, Alice [2026-03-26 00:28:01,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:28:01,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:28:01,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:28:01,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:28:02,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:28:02,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:28:03,637][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:28:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:28:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:28:05,780][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:28:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:28:07,208][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:28:07,924][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:28:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:28:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:28:10,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:28:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:28:11,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:28:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:28:12,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:28:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:28:14,361][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:28:15,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:28:15,793][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:28:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:28:17,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:28:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:28:18,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:28:19,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:28:20,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:28:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:28:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:28:22,236][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:28:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:28:23,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:28:24,386][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:28:25,102][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:28:25,819][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:28:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:28:27,251][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:28:27,969][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:28:28,683][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:28:29,402][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:28:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:28:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:28:31,552][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:28:32,270][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:28:32,984][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:28:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:28:34,417][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:28:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:28:35,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:28:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:28:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:28:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:28:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:28:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:28:40,381][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:28:41,099][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:28:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:28:42,532][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:28:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:28:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:28:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:28:45,398][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:28:46,114][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:28:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:28:47,547][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:28:48,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:28:48,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:28:49,963][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:28:49,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:28:49,966][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:28:51,842][__main__][INFO] - Iteration 632 took 56s (9.96% Gen, 86.71% Train). Generation: 5s, Training: 48s. Estimated remaining time: 5h 37m 26s. Estimated total time: 15h 41m 44s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 10s, 500 more iterations: 7h 50m 52s. [2026-03-26 00:28:51,844][__main__][INFO] - Starting iteration 632. [2026-03-26 00:28:51,848][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:28:51,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:28:58,765][__main__][INFO] - Number of regex retries in iteration 632: 0 [2026-03-26 00:28:58,766][__main__][INFO] - agents played in iteration 632 are Bob, Alice [2026-03-26 00:28:59,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:28:59,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:28:59,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:28:59,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:29:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:29:00,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:29:01,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:29:02,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:29:02,811][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:29:03,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:29:04,240][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:29:04,955][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:29:05,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:29:06,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:29:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:29:07,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:29:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:29:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:29:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:29:10,679][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:29:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:29:12,112][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:29:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:29:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:29:14,259][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:29:14,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:29:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:29:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:29:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:29:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:29:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:29:19,273][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:29:19,988][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:29:20,708][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:29:21,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:29:22,140][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:29:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:29:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:29:24,289][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:29:25,005][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:29:25,722][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:29:26,436][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:29:27,152][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:29:27,869][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:29:28,584][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:29:29,301][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:29:30,016][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:29:30,733][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:29:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:29:32,166][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:29:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:29:33,598][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:29:34,550][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:29:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:29:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:29:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:29:37,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:29:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:29:38,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:29:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:29:40,287][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:29:41,004][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:29:41,723][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:29:42,439][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:29:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:29:43,875][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:29:44,592][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:29:45,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:29:46,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:29:46,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:29:47,953][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:29:47,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:29:47,958][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:29:49,418][__main__][INFO] - Iteration 633 took 57s (12.01% Gen, 85.44% Train). Generation: 6s, Training: 49s. Estimated remaining time: 5h 54m 16s. Estimated total time: 15h 59m 31s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 57s, 500 more iterations: 7h 59m 45s. [2026-03-26 00:29:49,421][__main__][INFO] - Starting iteration 633. [2026-03-26 00:29:49,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:29:49,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:29:54,355][__main__][INFO] - Number of regex retries in iteration 633: 0 [2026-03-26 00:29:54,356][__main__][INFO] - agents played in iteration 633 are Bob, Alice [2026-03-26 00:29:54,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:29:54,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:29:54,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:29:54,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:29:55,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:29:56,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:29:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:29:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:29:58,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:29:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:29:59,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:30:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:30:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:30:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:30:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:30:03,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:30:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:30:04,864][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:30:05,580][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:30:06,296][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:30:07,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:30:07,730][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:30:08,446][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:30:09,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:30:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:30:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:30:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:30:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:30:12,749][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:30:13,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:30:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:30:14,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:30:15,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:30:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:30:17,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:30:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:30:18,484][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:30:19,202][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:30:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:30:20,636][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:30:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:30:22,070][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:30:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:30:23,503][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:30:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:30:24,935][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:30:25,654][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:30:26,369][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:30:27,088][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:30:27,804][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:30:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:30:29,238][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:30:30,278][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:30:30,996][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:30:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:30:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:30:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:30:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:30:34,579][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:30:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:30:36,012][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:30:36,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:30:37,444][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:30:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:30:38,880][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:30:39,597][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:30:40,316][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:30:41,033][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:30:41,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:30:42,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:30:43,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:30:43,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:30:43,408][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:30:45,346][__main__][INFO] - Iteration 634 took 55s (8.82% Gen, 87.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 25m 52s. Estimated total time: 15h 32m 3s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 1s. [2026-03-26 00:30:45,349][__main__][INFO] - Starting iteration 634. [2026-03-26 00:30:45,353][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:30:45,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:30:50,271][__main__][INFO] - Number of regex retries in iteration 634: 0 [2026-03-26 00:30:50,273][__main__][INFO] - agents played in iteration 634 are Bob, Alice [2026-03-26 00:30:50,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:30:50,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:30:50,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:30:50,843][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:30:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:30:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:30:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:30:53,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:30:54,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:30:55,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:30:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:30:56,476][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:30:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:30:57,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:30:58,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:30:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:31:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:31:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:31:01,491][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:31:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:31:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:31:03,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:31:04,355][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:31:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:31:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:31:06,501][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:31:07,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:31:07,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:31:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:31:09,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:31:10,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:31:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:31:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:31:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:31:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:31:13,665][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:31:14,384][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:31:15,103][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:31:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:31:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:31:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:31:17,976][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:31:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:31:19,413][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:31:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:31:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:31:21,571][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:31:22,290][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:31:23,008][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:31:23,728][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:31:24,446][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:31:25,166][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:31:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:31:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:31:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:31:28,280][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:31:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:31:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:31:30,435][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:31:31,153][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:31:31,872][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:31:32,592][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:31:33,311][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:31:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:31:34,749][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:31:35,467][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:31:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:31:36,905][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:31:37,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:31:38,375][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:31:39,372][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:31:39,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:31:39,376][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:31:45,816][__main__][INFO] - Iteration 635 took 1m 0s (8.14% Gen, 81.21% Train). Generation: 4s, Training: 49s. Estimated remaining time: 6h 40m 33s. Estimated total time: 16h 47m 44s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 46s, 500 more iterations: 8h 23m 52s. [2026-03-26 00:31:45,819][__main__][INFO] - Starting iteration 635. [2026-03-26 00:31:45,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:31:45,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:31:55,124][__main__][INFO] - Number of regex retries in iteration 635: 0 [2026-03-26 00:31:55,125][__main__][INFO] - agents played in iteration 635 are Bob, Alice [2026-03-26 00:31:55,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:31:55,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:31:55,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:31:55,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:31:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:31:57,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:31:57,769][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:31:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:31:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:31:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:32:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:32:01,336][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:32:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:32:02,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:32:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:32:04,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:32:04,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:32:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:32:06,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:32:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:32:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:32:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:32:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:32:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:32:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:32:11,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:32:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:32:12,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:32:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:32:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:32:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:32:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:32:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:32:17,077][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:32:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:32:18,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:32:19,231][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:32:19,950][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:32:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:32:21,387][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:32:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:32:22,822][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:32:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:32:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:32:24,972][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:32:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:32:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:32:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:32:27,847][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:32:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:32:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:32:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:32:30,976][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:32:31,695][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:32:32,413][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:32:33,127][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:32:33,846][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:32:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:32:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:32:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:32:36,706][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:32:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:32:38,140][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:32:38,857][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:32:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:32:40,289][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:32:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:32:41,721][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:32:42,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:32:43,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:32:44,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:32:44,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:32:44,204][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:32:45,908][__main__][INFO] - Iteration 636 took 1m 0s (15.48% Gen, 81.68% Train). Generation: 9s, Training: 49s. Estimated remaining time: 6h 33m 15s. Estimated total time: 16h 41m 27s. Time estimates for 10 more iterations: 10m 0s, 100 more iterations: 1h 40m 8s, 500 more iterations: 8h 20m 43s. [2026-03-26 00:32:45,912][__main__][INFO] - Starting iteration 636. [2026-03-26 00:32:45,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:32:45,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:32:50,817][__main__][INFO] - Number of regex retries in iteration 636: 0 [2026-03-26 00:32:50,818][__main__][INFO] - agents played in iteration 636 are Bob, Alice [2026-03-26 00:32:51,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:32:51,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:32:51,432][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:32:51,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:32:52,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:32:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:32:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:32:54,213][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:32:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:32:55,641][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:32:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:32:57,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:32:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:32:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:32:59,215][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:32:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:33:00,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:33:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:33:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:33:02,802][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:33:03,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:33:04,238][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:33:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:33:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:33:06,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:33:07,218][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:33:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:33:08,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:33:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:33:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:33:10,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:33:11,523][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:33:12,239][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:33:12,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:33:13,670][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:33:14,384][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:33:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:33:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:33:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:33:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:33:17,965][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:33:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:33:19,397][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:33:20,113][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:33:20,830][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:33:21,545][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:33:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:33:22,978][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:33:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:33:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:33:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:33:25,845][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:33:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:33:27,560][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:33:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:33:28,991][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:33:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:33:30,426][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:33:31,143][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:33:31,860][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:33:32,577][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:33:33,294][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:33:34,009][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:33:34,727][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:33:35,443][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:33:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:33:36,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:33:37,593][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:33:38,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:33:39,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:33:40,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:33:40,097][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:33:40,098][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:33:42,203][__main__][INFO] - Iteration 637 took 56s (8.71% Gen, 87.55% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 29m 1s. Estimated total time: 15h 38m 9s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 4s. [2026-03-26 00:33:42,208][__main__][INFO] - Starting iteration 637. [2026-03-26 00:33:42,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:33:42,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:33:47,407][__main__][INFO] - Number of regex retries in iteration 637: 0 [2026-03-26 00:33:47,408][__main__][INFO] - agents played in iteration 637 are Bob, Alice [2026-03-26 00:33:48,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:33:48,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:33:48,403][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:33:48,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:33:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:33:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:33:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:33:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:33:51,887][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:33:52,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:33:53,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:33:54,033][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:33:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:33:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:33:56,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:33:56,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:33:57,608][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:33:58,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:33:59,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:33:59,758][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:34:00,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:34:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:34:01,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:34:02,622][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:34:03,339][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:34:04,054][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:34:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:34:05,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:34:06,203][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:34:06,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:34:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:34:08,354][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:34:09,069][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:34:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:34:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:34:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:34:11,934][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:34:12,648][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:34:13,365][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:34:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:34:14,798][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:34:15,512][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:34:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:34:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:34:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:34:18,378][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:34:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:34:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:34:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:34:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:34:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:34:22,678][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:34:23,625][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:34:24,343][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:34:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:34:25,776][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:34:26,495][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:34:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:34:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:34:28,646][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:34:29,364][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:34:30,080][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:34:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:34:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:34:32,232][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:34:32,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:34:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:34:34,388][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:34:35,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:34:35,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:34:36,859][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:34:36,863][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:34:36,865][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:34:38,500][__main__][INFO] - Iteration 638 took 56s (9.22% Gen, 87.87% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 27m 59s. Estimated total time: 15h 38m 3s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 1s. [2026-03-26 00:34:38,502][__main__][INFO] - Starting iteration 638. [2026-03-26 00:34:38,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:34:38,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:34:43,417][__main__][INFO] - Number of regex retries in iteration 638: 0 [2026-03-26 00:34:43,418][__main__][INFO] - agents played in iteration 638 are Bob, Alice [2026-03-26 00:34:43,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:34:44,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:34:44,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:34:44,043][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:34:44,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:34:45,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:34:46,101][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:34:46,816][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:34:47,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:34:48,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:34:48,962][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:34:49,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:34:50,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:34:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:34:51,825][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:34:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:34:53,259][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:34:53,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:34:54,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:34:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:34:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:34:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:34:57,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:34:58,281][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:34:58,998][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:34:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:35:00,434][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:35:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:35:01,869][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:35:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:35:03,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:35:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:35:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:35:05,458][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:35:06,176][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:35:06,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:35:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:35:08,326][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:35:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:35:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:35:10,482][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:35:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:35:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:35:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:35:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:35:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:35:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:35:15,514][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:35:16,233][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:35:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:35:17,668][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:35:18,384][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:35:19,345][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:35:20,062][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:35:20,778][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:35:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:35:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:35:22,928][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:35:23,644][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:35:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:35:25,077][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:35:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:35:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:35:27,228][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:35:27,946][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:35:28,662][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:35:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:35:30,097][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:35:30,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:35:31,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:35:32,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:35:32,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:35:32,638][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:35:34,687][__main__][INFO] - Iteration 639 took 56s (8.74% Gen, 87.61% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 25m 22s. Estimated total time: 15h 36m 22s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 11s. [2026-03-26 00:35:34,690][__main__][INFO] - Starting iteration 639. [2026-03-26 00:35:34,695][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:35:34,695][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:35:39,760][__main__][INFO] - Number of regex retries in iteration 639: 0 [2026-03-26 00:35:39,762][__main__][INFO] - agents played in iteration 639 are Bob, Alice [2026-03-26 00:35:40,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:35:40,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:35:40,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:35:40,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:35:41,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:35:41,726][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:35:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:35:43,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:35:43,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:35:44,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:35:45,310][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:35:46,027][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:35:46,742][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:35:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:35:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:35:48,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:35:49,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:35:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:35:51,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:35:51,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:35:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:35:53,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:35:53,911][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:35:54,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:35:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:35:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:35:56,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:35:57,499][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:35:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:35:58,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:35:59,650][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:36:00,365][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:36:01,082][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:36:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:36:02,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:36:03,230][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:36:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:36:04,662][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:36:05,377][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:36:06,094][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:36:06,810][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:36:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:36:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:36:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:36:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:36:10,395][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:36:11,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:36:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:36:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:36:13,261][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:36:13,978][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:36:14,694][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:36:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:36:16,386][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:36:17,104][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:36:17,821][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:36:18,538][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:36:19,255][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:36:19,972][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:36:20,690][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:36:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:36:22,124][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:36:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:36:23,558][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:36:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:36:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:36:25,709][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:36:26,430][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:36:27,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:36:27,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:36:28,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:36:28,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:36:28,868][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:36:30,833][__main__][INFO] - Iteration 640 took 56s (9.02% Gen, 87.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 23m 44s. Estimated total time: 15h 35m 40s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 34s, 500 more iterations: 7h 47m 50s. [2026-03-26 00:36:30,837][__main__][INFO] - Starting iteration 640. [2026-03-26 00:36:30,843][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:36:30,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:36:35,895][__main__][INFO] - Number of regex retries in iteration 640: 0 [2026-03-26 00:36:35,896][__main__][INFO] - agents played in iteration 640 are Bob, Alice [2026-03-26 00:36:36,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:36:36,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:36:36,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:36:36,466][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:36:37,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:36:37,802][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:36:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:36:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:36:39,954][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:36:40,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:36:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:36:42,099][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:36:42,816][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:36:43,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:36:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:36:44,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:36:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:36:46,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:36:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:36:47,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:36:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:36:49,259][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:36:49,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:36:50,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:36:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:36:52,125][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:36:52,840][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:36:53,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:36:54,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:36:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:36:55,704][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:36:56,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:36:57,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:36:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:36:58,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:36:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:37:00,003][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:37:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:37:01,435][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:37:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:37:02,867][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:37:03,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:37:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:37:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:37:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:37:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:37:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:37:07,884][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:37:08,601][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:37:09,319][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:37:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:37:10,754][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:37:11,707][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:37:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:37:13,142][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:37:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:37:14,578][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:37:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:37:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:37:16,729][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:37:17,447][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:37:18,163][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:37:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:37:19,598][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:37:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:37:21,033][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:37:21,750][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:37:22,466][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:37:23,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:37:23,906][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:37:24,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:37:24,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:37:24,903][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:37:26,294][__main__][INFO] - Iteration 641 took 55s (9.11% Gen, 88.38% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 11m 21s. Estimated total time: 15h 24m 13s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 6s. [2026-03-26 00:37:26,297][__main__][INFO] - Starting iteration 641. [2026-03-26 00:37:26,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:37:26,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:37:31,240][__main__][INFO] - Number of regex retries in iteration 641: 0 [2026-03-26 00:37:31,241][__main__][INFO] - agents played in iteration 641 are Bob, Alice [2026-03-26 00:37:31,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:37:31,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:37:31,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:37:31,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:37:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:37:33,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:37:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:37:34,614][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:37:35,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:37:36,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:37:36,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:37:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:37:38,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:37:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:37:39,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:37:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:37:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:37:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:37:42,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:37:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:37:43,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:37:44,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:37:45,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:37:46,071][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:37:46,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:37:47,506][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:37:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:37:48,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:37:49,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:37:50,370][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:37:51,087][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:37:51,803][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:37:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:37:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:37:53,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:37:54,669][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:37:55,388][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:37:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:37:56,820][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:37:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:37:58,251][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:37:58,968][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:37:59,685][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:38:00,400][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:38:01,118][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:38:01,835][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:38:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:38:03,268][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:38:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:38:04,701][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:38:05,420][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:38:06,135][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:38:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:38:07,797][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:38:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:38:09,232][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:38:09,951][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:38:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:38:11,385][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:38:12,103][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:38:12,819][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:38:13,536][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:38:14,253][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:38:14,970][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:38:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:38:16,404][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:38:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:38:17,840][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:38:18,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:38:19,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:38:20,395][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:38:20,399][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:38:20,401][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:38:21,887][__main__][INFO] - Iteration 642 took 55s (8.88% Gen, 88.44% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 12m 40s. Estimated total time: 15h 26m 28s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 14s. [2026-03-26 00:38:21,890][__main__][INFO] - Starting iteration 642. [2026-03-26 00:38:21,894][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:38:21,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:38:26,929][__main__][INFO] - Number of regex retries in iteration 642: 0 [2026-03-26 00:38:26,931][__main__][INFO] - agents played in iteration 642 are Bob, Alice [2026-03-26 00:38:27,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:38:27,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:38:27,738][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:38:27,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:38:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:38:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:38:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:38:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:38:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:38:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:38:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:38:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:38:34,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:38:34,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:38:35,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:38:36,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:38:37,010][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:38:37,727][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:38:38,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:38:39,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:38:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:38:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:38:41,307][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:38:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:38:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:38:43,456][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:38:44,172][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:38:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:38:45,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:38:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:38:47,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:38:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:38:48,470][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:38:49,186][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:38:49,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:38:50,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:38:51,334][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:38:52,049][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:38:52,765][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:38:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:38:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:38:54,914][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:38:55,631][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:38:56,346][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:38:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:38:57,780][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:38:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:38:59,213][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:38:59,929][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:39:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:39:01,363][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:39:02,081][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:39:03,081][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:39:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:39:04,517][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:39:05,234][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:39:05,951][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:39:06,668][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:39:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:39:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:39:08,820][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:39:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:39:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:39:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:39:11,687][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:39:12,404][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:39:13,121][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:39:13,839][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:39:14,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:39:15,279][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:39:16,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:39:16,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:39:16,264][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:39:17,591][__main__][INFO] - Iteration 643 took 55s (9.04% Gen, 88.57% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 13m 35s. Estimated total time: 15h 28m 19s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 9s. [2026-03-26 00:39:17,594][__main__][INFO] - Starting iteration 643. [2026-03-26 00:39:17,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:39:17,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:39:23,487][__main__][INFO] - Number of regex retries in iteration 643: 0 [2026-03-26 00:39:23,488][__main__][INFO] - agents played in iteration 643 are Bob, Alice [2026-03-26 00:39:23,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:39:24,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:39:24,051][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:39:24,052][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:39:24,740][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:39:25,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:39:26,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:39:26,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:39:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:39:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:39:28,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:39:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:39:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:39:31,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:39:31,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:39:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:39:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:39:33,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:39:34,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:39:35,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:39:36,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:39:36,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:39:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:39:38,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:39:38,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:39:39,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:39:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:39:41,140][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:39:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:39:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:39:43,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:39:44,008][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:39:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:39:45,441][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:39:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:39:46,873][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:39:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:39:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:39:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:39:49,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:39:50,454][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:39:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:39:51,889][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:39:52,606][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:39:53,322][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:39:54,038][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:39:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:39:55,470][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:39:56,186][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:39:56,902][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:39:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:39:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:39:59,281][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:39:59,999][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:40:00,715][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:40:01,432][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:40:02,148][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:40:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:40:03,580][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:40:04,298][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:40:05,015][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:40:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:40:06,447][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:40:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:40:07,881][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:40:08,599][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:40:09,317][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:40:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:40:10,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:40:11,476][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:40:12,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:40:12,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:40:12,428][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:40:13,793][__main__][INFO] - Iteration 644 took 56s (10.48% Gen, 87.09% Train). Generation: 5s, Training: 48s. Estimated remaining time: 5h 20m 57s. Estimated total time: 15h 36m 37s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 18s. [2026-03-26 00:40:13,796][__main__][INFO] - Starting iteration 644. [2026-03-26 00:40:13,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:40:13,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:40:18,764][__main__][INFO] - Number of regex retries in iteration 644: 0 [2026-03-26 00:40:18,765][__main__][INFO] - agents played in iteration 644 are Bob, Alice [2026-03-26 00:40:19,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:40:19,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:40:19,367][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:40:19,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:40:20,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:40:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:40:21,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:40:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:40:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:40:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:40:24,300][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:40:25,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:40:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:40:26,447][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:40:27,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:40:27,880][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:40:28,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:40:29,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:40:30,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:40:30,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:40:31,462][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:40:32,179][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:40:32,895][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:40:33,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:40:34,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:40:35,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:40:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:40:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:40:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:40:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:40:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:40:39,345][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:40:40,060][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:40:40,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:40:41,492][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:40:42,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:40:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:40:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:40:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:40:45,075][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:40:45,793][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:40:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:40:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:40:47,944][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:40:48,660][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:40:49,378][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:40:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:40:50,811][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:40:51,528][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:40:52,245][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:40:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:40:53,678][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:40:54,629][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:40:55,346][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:40:56,062][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:40:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:40:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:40:58,213][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:40:58,931][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:40:59,647][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:41:00,367][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:41:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:41:01,801][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:41:02,517][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:41:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:41:03,953][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:41:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:41:05,386][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:41:06,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:41:06,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:41:07,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:41:08,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:41:08,004][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:41:09,279][__main__][INFO] - Iteration 645 took 55s (8.95% Gen, 88.75% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 8m 6s. Estimated total time: 15h 24m 41s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 20s. [2026-03-26 00:41:09,282][__main__][INFO] - Starting iteration 645. [2026-03-26 00:41:09,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:41:09,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:41:14,378][__main__][INFO] - Number of regex retries in iteration 645: 0 [2026-03-26 00:41:14,382][__main__][INFO] - agents played in iteration 645 are Bob, Alice [2026-03-26 00:41:14,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:41:15,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:41:15,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:41:15,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:41:15,742][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:41:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:41:17,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:41:17,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:41:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:41:19,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:41:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:41:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:41:21,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:41:22,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:41:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:41:23,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:41:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:41:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:41:25,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:41:26,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:41:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:41:27,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:41:28,598][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:41:29,314][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:41:30,032][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:41:30,747][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:41:31,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:41:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:41:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:41:33,616][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:41:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:41:35,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:41:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:41:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:41:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:41:37,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:41:38,633][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:41:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:41:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:41:40,783][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:41:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:41:42,217][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:41:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:41:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:41:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:41:45,083][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:41:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:41:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:41:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:41:47,949][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:41:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:41:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:41:50,416][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:41:51,135][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:41:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:41:52,567][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:41:53,284][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:41:54,000][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:41:54,716][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:41:55,433][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:41:56,151][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:41:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:41:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:41:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:41:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:41:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:42:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:42:01,171][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:42:01,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:42:02,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:42:03,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:42:03,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:42:03,650][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:42:06,029][__main__][INFO] - Iteration 646 took 56s (8.98% Gen, 86.82% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 28m 12s. Estimated total time: 15h 45m 44s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 34s, 500 more iterations: 7h 52m 52s. [2026-03-26 00:42:06,033][__main__][INFO] - Starting iteration 646. [2026-03-26 00:42:06,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:42:06,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:42:11,066][__main__][INFO] - Number of regex retries in iteration 646: 0 [2026-03-26 00:42:11,067][__main__][INFO] - agents played in iteration 646 are Bob, Alice [2026-03-26 00:42:11,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:42:11,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:42:11,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:42:11,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:42:12,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:42:12,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:42:13,717][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:42:14,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:42:15,148][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:42:15,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:42:16,579][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:42:17,295][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:42:18,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:42:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:42:19,443][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:42:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:42:20,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:42:21,592][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:42:22,308][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:42:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:42:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:42:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:42:25,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:42:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:42:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:42:27,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:42:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:42:28,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:42:29,475][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:42:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:42:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:42:31,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:42:32,351][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:42:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:42:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:42:34,511][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:42:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:42:35,949][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:42:36,668][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:42:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:42:38,104][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:42:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:42:39,542][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:42:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:42:40,978][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:42:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:42:42,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:42:43,132][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:42:43,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:42:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:42:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:42:46,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:42:46,950][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:42:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:42:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:42:49,102][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:42:49,818][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:42:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:42:51,251][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:42:51,967][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:42:52,683][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:42:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:42:54,118][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:42:54,833][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:42:55,551][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:42:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:42:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:42:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:42:58,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:42:59,149][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:43:00,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:43:00,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:43:00,106][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:43:01,462][__main__][INFO] - Iteration 647 took 55s (9.06% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 5m 15s. Estimated total time: 15h 23m 43s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 51s. [2026-03-26 00:43:01,464][__main__][INFO] - Starting iteration 647. [2026-03-26 00:43:01,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:43:01,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:43:06,560][__main__][INFO] - Number of regex retries in iteration 647: 0 [2026-03-26 00:43:06,561][__main__][INFO] - agents played in iteration 647 are Bob, Alice [2026-03-26 00:43:07,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:43:07,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:43:07,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:43:07,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:43:08,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:43:08,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:43:09,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:43:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:43:10,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:43:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:43:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:43:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:43:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:43:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:43:15,231][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:43:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:43:16,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:43:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:43:18,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:43:18,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:43:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:43:20,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:43:20,960][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:43:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:43:22,394][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:43:23,114][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:43:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:43:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:43:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:43:25,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:43:26,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:43:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:43:28,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:43:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:43:29,561][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:43:30,276][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:43:30,993][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:43:31,709][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:43:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:43:33,143][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:43:33,859][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:43:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:43:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:43:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:43:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:43:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:43:38,156][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:43:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:43:39,591][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:43:40,307][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:43:41,024][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:43:41,741][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:43:42,682][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:43:43,400][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:43:44,116][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:43:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:43:45,550][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:43:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:43:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:43:47,703][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:43:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:43:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:43:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:43:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:43:51,295][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:43:52,015][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:43:52,738][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:43:53,460][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:43:54,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:43:54,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:43:55,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:43:55,948][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:43:55,950][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:43:57,277][__main__][INFO] - Iteration 648 took 55s (9.12% Gen, 88.49% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 10m 47s. Estimated total time: 15h 30m 10s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 5s. [2026-03-26 00:43:57,279][__main__][INFO] - Starting iteration 648. [2026-03-26 00:43:57,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:43:57,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:44:02,243][__main__][INFO] - Number of regex retries in iteration 648: 0 [2026-03-26 00:44:02,244][__main__][INFO] - agents played in iteration 648 are Bob, Alice [2026-03-26 00:44:02,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:44:02,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:44:02,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:44:02,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:44:03,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:44:04,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:44:04,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:44:05,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:44:06,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:44:07,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:44:07,734][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:44:08,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:44:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:44:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:44:10,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:44:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:44:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:44:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:44:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:44:14,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:44:14,896][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:44:15,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:44:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:44:17,047][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:44:17,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:44:18,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:44:19,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:44:19,916][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:44:20,631][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:44:21,349][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:44:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:44:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:44:23,500][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:44:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:44:24,932][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:44:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:44:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:44:27,081][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:44:27,796][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:44:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:44:29,229][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:44:29,946][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:44:30,662][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:44:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:44:32,094][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:44:32,812][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:44:33,528][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:44:34,244][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:44:34,961][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:44:35,677][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:44:36,394][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:44:37,111][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:44:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:44:38,870][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:44:39,588][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:44:40,302][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:44:41,022][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:44:41,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:44:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:44:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:44:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:44:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:44:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:44:46,040][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:44:46,757][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:44:47,476][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:44:48,192][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:44:48,910][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:44:49,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:44:50,358][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:44:51,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:44:51,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:44:51,403][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:44:52,851][__main__][INFO] - Iteration 649 took 55s (8.93% Gen, 88.46% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 5m 50s. Estimated total time: 15h 26m 8s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 36s, 500 more iterations: 7h 43m 4s. [2026-03-26 00:44:52,853][__main__][INFO] - Starting iteration 649. [2026-03-26 00:44:52,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:44:52,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:44:58,055][__main__][INFO] - Number of regex retries in iteration 649: 0 [2026-03-26 00:44:58,056][__main__][INFO] - agents played in iteration 649 are Bob, Alice [2026-03-26 00:44:58,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:44:58,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:44:58,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:44:58,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:44:59,311][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:44:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:45:00,677][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:45:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:45:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:45:02,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:45:03,539][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:45:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:45:04,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:45:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:45:06,405][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:45:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:45:07,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:45:08,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:45:09,270][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:45:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:45:10,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:45:11,418][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:45:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:45:12,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:45:13,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:45:14,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:45:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:45:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:45:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:45:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:45:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:45:18,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:45:19,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:45:20,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:45:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:45:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:45:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:45:22,884][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:45:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:45:24,319][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:45:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:45:25,753][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:45:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:45:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:45:27,904][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:45:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:45:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:45:30,054][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:45:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:45:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:45:32,205][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:45:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:45:33,870][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:45:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:45:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:45:36,022][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:45:36,741][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:45:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:45:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:45:38,891][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:45:39,610][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:45:40,327][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:45:41,045][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:45:41,762][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:45:42,480][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:45:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:45:43,915][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:45:44,633][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:45:45,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:45:46,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:45:47,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:45:47,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:45:47,071][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:45:48,388][__main__][INFO] - Iteration 650 took 55s (9.36% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 4m 17s. Estimated total time: 15h 25m 32s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 46s. [2026-03-26 00:45:48,391][__main__][INFO] - Starting iteration 650. [2026-03-26 00:45:48,394][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2026-03-26 00:45:48,395][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:45:53,436][__main__][INFO] - Number of regex retries in iteration 650: 0 [2026-03-26 00:45:53,437][__main__][INFO] - agents played in iteration 650 are Bob, Alice [2026-03-26 00:45:53,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:45:54,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:45:54,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:45:54,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:45:54,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:45:55,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:45:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:45:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:45:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:45:58,214][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:45:58,930][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:45:59,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:46:00,361][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:46:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:46:01,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:46:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:46:03,226][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:46:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:46:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:46:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:46:06,092][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:46:06,808][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:46:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:46:08,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:46:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:46:09,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:46:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:46:11,112][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:46:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:46:12,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:46:13,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:46:13,982][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:46:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:46:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:46:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:46:16,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:46:17,562][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:46:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:46:18,994][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:46:19,710][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:46:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:46:21,142][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:46:21,858][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:46:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:46:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:46:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:46:24,725][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:46:25,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:46:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:46:26,874][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:46:27,590][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:46:28,309][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:46:29,255][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:46:29,973][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:46:30,690][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:46:31,407][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:46:32,124][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:46:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:46:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:46:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:46:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:46:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:46:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:46:37,141][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:46:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:46:38,575][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:46:39,293][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:46:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:46:40,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:46:41,451][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:46:42,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:46:42,730][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:46:42,732][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:46:45,545][__main__][INFO] - Iteration 651 took 57s (8.82% Gen, 86.25% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 30m 21s. Estimated total time: 15h 52m 32s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 15s, 500 more iterations: 7h 56m 16s. [2026-03-26 00:46:45,550][__main__][INFO] - Starting iteration 651. [2026-03-26 00:46:45,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:46:45,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:46:51,600][__main__][INFO] - Number of regex retries in iteration 651: 0 [2026-03-26 00:46:51,602][__main__][INFO] - agents played in iteration 651 are Bob, Alice [2026-03-26 00:46:52,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:46:52,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:46:52,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:46:52,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:46:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:46:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:46:54,209][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:46:54,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:46:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:46:56,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:46:57,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:46:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:46:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:46:59,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:46:59,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:47:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:47:01,358][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:47:02,072][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:47:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:47:03,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:47:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:47:04,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:47:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:47:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:47:07,084][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:47:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:47:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:47:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:47:09,949][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:47:10,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:47:11,382][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:47:12,097][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:47:12,815][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:47:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:47:14,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:47:14,965][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:47:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:47:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:47:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:47:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:47:18,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:47:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:47:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:47:20,702][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:47:21,417][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:47:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:47:22,850][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:47:23,568][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:47:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:47:25,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:47:25,716][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:47:26,433][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:47:27,467][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:47:28,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:47:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:47:29,616][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:47:30,333][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:47:31,050][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:47:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:47:32,484][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:47:33,200][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:47:33,917][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:47:34,633][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:47:35,351][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:47:36,067][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:47:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:47:37,502][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:47:38,219][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:47:38,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:47:39,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:47:40,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:47:40,739][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:47:40,741][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:47:42,029][__main__][INFO] - Iteration 652 took 56s (10.71% Gen, 87.01% Train). Generation: 6s, Training: 49s. Estimated remaining time: 5h 18m 8s. Estimated total time: 15h 41m 16s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 7s, 500 more iterations: 7h 50m 38s. [2026-03-26 00:47:42,033][__main__][INFO] - Starting iteration 652. [2026-03-26 00:47:42,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:47:42,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:47:47,155][__main__][INFO] - Number of regex retries in iteration 652: 0 [2026-03-26 00:47:47,157][__main__][INFO] - agents played in iteration 652 are Bob, Alice [2026-03-26 00:47:47,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:47:47,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:47:47,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:47:47,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:47:48,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:47:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:47:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:47:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:47:51,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:47:52,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:47:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:47:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:47:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:47:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:47:55,588][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:47:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:47:57,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:47:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:47:58,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:47:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:47:59,886][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:48:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:48:01,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:48:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:48:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:48:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:48:04,184][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:48:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:48:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:48:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:48:07,051][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:48:07,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:48:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:48:09,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:48:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:48:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:48:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:48:12,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:48:12,778][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:48:13,494][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:48:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:48:14,926][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:48:15,642][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:48:16,359][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:48:17,075][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:48:17,793][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:48:18,508][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:48:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:48:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:48:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:48:21,374][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:48:22,092][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:48:23,036][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:48:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:48:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:48:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:48:25,907][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:48:26,624][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:48:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:48:28,058][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:48:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:48:29,496][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:48:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:48:30,931][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:48:31,647][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:48:32,364][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:48:33,082][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:48:33,800][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:48:34,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:48:35,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:48:36,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:48:36,199][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:48:36,200][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:48:37,559][__main__][INFO] - Iteration 653 took 55s (9.21% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 1m 18s. Estimated total time: 15h 25m 22s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 32s, 500 more iterations: 7h 42m 41s. [2026-03-26 00:48:37,561][__main__][INFO] - Starting iteration 653. [2026-03-26 00:48:37,567][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:48:37,568][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:48:42,535][__main__][INFO] - Number of regex retries in iteration 653: 0 [2026-03-26 00:48:42,536][__main__][INFO] - agents played in iteration 653 are Bob, Alice [2026-03-26 00:48:43,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:48:43,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:48:43,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:48:43,150][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:48:43,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:48:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:48:45,187][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:48:45,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:48:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:48:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:48:48,054][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:48:48,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:48:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:48:50,203][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:48:50,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:48:51,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:48:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:48:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:48:53,788][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:48:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:48:55,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:48:55,937][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:48:56,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:48:57,371][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:48:58,087][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:48:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:48:59,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:49:00,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:49:00,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:49:01,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:49:02,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:49:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:49:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:49:04,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:49:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:49:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:49:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:49:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:49:08,123][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:49:08,840][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:49:09,558][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:49:10,273][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:49:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:49:11,706][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:49:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:49:13,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:49:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:49:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:49:15,294][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:49:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:49:16,729][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:49:17,445][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:49:18,390][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:49:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:49:19,824][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:49:20,542][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:49:21,259][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:49:21,977][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:49:22,694][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:49:23,412][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:49:24,128][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:49:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:49:25,564][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:49:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:49:26,999][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:49:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:49:28,437][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:49:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:49:29,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:49:30,619][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:49:31,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:49:31,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:49:31,632][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:49:32,990][__main__][INFO] - Iteration 654 took 55s (8.96% Gen, 88.58% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 58m 46s. Estimated total time: 15h 23m 45s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 52s. [2026-03-26 00:49:32,993][__main__][INFO] - Starting iteration 654. [2026-03-26 00:49:32,997][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:49:32,998][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:49:39,610][__main__][INFO] - Number of regex retries in iteration 654: 0 [2026-03-26 00:49:39,611][__main__][INFO] - agents played in iteration 654 are Bob, Alice [2026-03-26 00:49:40,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:49:40,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:49:40,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:49:40,181][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:49:40,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:49:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:49:42,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:49:42,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:49:43,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:49:44,369][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:49:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:49:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:49:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:49:47,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:49:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:49:48,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:49:49,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:49:50,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:49:50,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:49:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:49:52,239][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:49:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:49:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:49:54,387][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:49:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:49:55,820][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:49:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:49:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:49:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:49:58,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:49:59,402][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:50:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:50:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:50:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:50:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:50:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:50:03,701][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:50:04,418][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:50:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:50:05,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:50:06,571][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:50:07,286][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:50:08,005][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:50:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:50:09,441][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:50:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:50:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:50:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:50:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:50:13,032][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:50:13,748][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:50:14,464][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:50:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:50:16,179][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:50:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:50:17,612][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:50:18,329][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:50:19,044][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:50:19,762][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:50:20,479][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:50:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:50:21,912][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:50:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:50:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:50:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:50:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:50:25,496][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:50:26,214][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:50:26,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:50:27,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:50:28,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:50:28,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:50:28,706][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:50:29,954][__main__][INFO] - Iteration 655 took 56s (11.61% Gen, 86.19% Train). Generation: 6s, Training: 49s. Estimated remaining time: 5h 23m 22s. Estimated total time: 15h 49m 18s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 55s, 500 more iterations: 7h 54m 39s. [2026-03-26 00:50:29,957][__main__][INFO] - Starting iteration 655. [2026-03-26 00:50:29,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:50:29,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:50:34,901][__main__][INFO] - Number of regex retries in iteration 655: 0 [2026-03-26 00:50:34,902][__main__][INFO] - agents played in iteration 655 are Bob, Alice [2026-03-26 00:50:35,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:50:35,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:50:35,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:50:35,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:50:36,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:50:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:50:37,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:50:38,230][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:50:38,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:50:39,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:50:40,384][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:50:41,099][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:50:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:50:42,530][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:50:43,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:50:43,963][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:50:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:50:45,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:50:46,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:50:46,829][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:50:47,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:50:48,261][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:50:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:50:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:50:50,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:50:51,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:50:51,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:50:52,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:50:53,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:50:53,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:50:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:50:55,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:50:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:50:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:50:57,579][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:50:58,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:50:59,016][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:50:59,732][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:51:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:51:01,164][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:51:01,881][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:51:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:51:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:51:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:51:04,746][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:51:05,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:51:06,178][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:51:06,895][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:51:07,611][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:51:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:51:09,050][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:51:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:51:10,714][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:51:11,433][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:51:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:51:12,867][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:51:13,584][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:51:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:51:15,019][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:51:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:51:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:51:17,170][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:51:17,888][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:51:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:51:19,323][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:51:20,039][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:51:20,757][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:51:21,474][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:51:22,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:51:22,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:51:23,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:51:23,968][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:51:23,969][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:51:25,331][__main__][INFO] - Iteration 656 took 55s (8.92% Gen, 88.61% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 56m 1s. Estimated total time: 15h 22m 52s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 26s. [2026-03-26 00:51:25,334][__main__][INFO] - Starting iteration 656. [2026-03-26 00:51:25,338][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:51:25,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:51:30,268][__main__][INFO] - Number of regex retries in iteration 656: 0 [2026-03-26 00:51:30,269][__main__][INFO] - agents played in iteration 656 are Bob, Alice [2026-03-26 00:51:30,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:51:30,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:51:30,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:51:30,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:51:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:51:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:51:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:51:33,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:51:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:51:35,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:51:35,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:51:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:51:37,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:51:37,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:51:38,608][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:51:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:51:40,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:51:40,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:51:41,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:51:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:51:42,909][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:51:43,626][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:51:44,345][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:51:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:51:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:51:46,497][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:51:47,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:51:47,932][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:51:48,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:51:49,365][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:51:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:51:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:51:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:51:52,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:51:52,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:51:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:51:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:51:55,105][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:51:55,824][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:51:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:51:57,258][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:51:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:51:58,692][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:51:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:52:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:52:00,845][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:52:01,562][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:52:02,279][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:52:02,995][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:52:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:52:04,430][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:52:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:52:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:52:06,812][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:52:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:52:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:52:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:52:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:52:10,399][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:52:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:52:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:52:12,550][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:52:13,267][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:52:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:52:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:52:15,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:52:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:52:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:52:17,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:52:18,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:52:19,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:52:19,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:52:19,630][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:52:21,154][__main__][INFO] - Iteration 657 took 55s (8.83% Gen, 88.43% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 2m 30s. Estimated total time: 15h 30m 17s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 8s. [2026-03-26 00:52:21,158][__main__][INFO] - Starting iteration 657. [2026-03-26 00:52:21,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:52:21,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:52:26,205][__main__][INFO] - Number of regex retries in iteration 657: 0 [2026-03-26 00:52:26,206][__main__][INFO] - agents played in iteration 657 are Bob, Alice [2026-03-26 00:52:26,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:52:26,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:52:26,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:52:26,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:52:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:52:28,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:52:28,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:52:29,530][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:52:30,244][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:52:30,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:52:31,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:52:32,394][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:52:33,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:52:33,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:52:34,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:52:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:52:35,974][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:52:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:52:37,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:52:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:52:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:52:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:52:40,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:52:40,991][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:52:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:52:42,425][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:52:43,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:52:43,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:52:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:52:45,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:52:46,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:52:46,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:52:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:52:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:52:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:52:49,595][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:52:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:52:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:52:51,745][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:52:52,463][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:52:53,179][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:52:53,896][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:52:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:52:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:52:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:52:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:52:57,480][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:52:58,196][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:52:58,915][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:52:59,631][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:53:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:53:01,065][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:53:02,078][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:53:02,796][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:53:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:53:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:53:04,948][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:53:05,665][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:53:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:53:07,099][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:53:07,815][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:53:08,535][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:53:09,253][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:53:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:53:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:53:11,405][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:53:12,123][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:53:12,841][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:53:13,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:53:14,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:53:15,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:53:15,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:53:15,284][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:53:16,638][__main__][INFO] - Iteration 658 took 55s (9.08% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 55m 54s. Estimated total time: 15h 24m 36s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 18s. [2026-03-26 00:53:16,641][__main__][INFO] - Starting iteration 658. [2026-03-26 00:53:16,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:53:16,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:53:21,617][__main__][INFO] - Number of regex retries in iteration 658: 0 [2026-03-26 00:53:21,618][__main__][INFO] - agents played in iteration 658 are Bob, Alice [2026-03-26 00:53:22,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:53:22,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:53:22,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:53:22,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:53:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:53:23,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:53:24,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:53:24,955][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:53:25,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:53:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:53:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:53:27,818][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:53:28,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:53:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:53:29,967][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:53:30,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:53:31,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:53:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:53:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:53:33,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:53:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:53:34,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:53:35,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:53:36,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:53:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:53:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:53:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:53:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:53:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:53:40,724][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:53:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:53:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:53:42,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:53:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:53:44,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:53:45,024][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:53:45,740][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:53:46,458][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:53:47,175][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:53:47,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:53:48,608][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:53:49,326][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:53:50,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:53:50,760][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:53:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:53:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:53:52,909][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:53:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:53:54,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:53:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:53:55,775][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:53:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:53:57,444][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:53:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:53:58,880][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:53:59,596][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:54:00,313][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:54:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:54:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:54:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:54:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:54:03,903][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:54:04,619][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:54:05,337][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:54:06,055][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:54:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:54:07,489][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:54:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:54:08,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:54:09,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:54:10,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:54:10,635][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:54:10,637][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:54:12,035][__main__][INFO] - Iteration 659 took 55s (8.98% Gen, 88.49% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 53m 34s. Estimated total time: 15h 23m 12s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 36s. [2026-03-26 00:54:12,038][__main__][INFO] - Starting iteration 659. [2026-03-26 00:54:12,043][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:54:12,043][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:54:17,810][__main__][INFO] - Number of regex retries in iteration 659: 0 [2026-03-26 00:54:17,811][__main__][INFO] - agents played in iteration 659 are Bob, Alice [2026-03-26 00:54:18,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:54:18,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:54:18,399][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:54:18,400][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:54:19,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:54:19,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:54:20,469][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:54:21,183][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:54:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:54:22,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:54:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:54:24,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:54:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:54:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:54:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:54:26,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:54:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:54:28,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:54:29,059][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:54:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:54:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:54:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:54:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:54:32,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:54:33,356][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:54:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:54:34,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:54:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:54:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:54:36,942][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:54:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:54:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:54:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:54:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:54:40,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:54:41,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:54:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:54:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:54:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:54:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:54:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:54:45,540][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:54:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:54:46,974][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:54:47,690][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:54:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:54:49,123][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:54:49,839][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:54:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:54:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:54:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:54:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:54:53,657][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:54:54,375][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:54:55,091][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:54:55,808][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:54:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:54:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:54:57,958][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:54:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:54:59,391][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:55:00,110][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:55:00,826][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:55:01,544][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:55:02,261][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:55:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:55:03,696][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:55:04,417][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:55:05,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:55:05,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:55:06,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:55:07,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:55:07,638][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:55:08,952][__main__][INFO] - Iteration 660 took 56s (10.13% Gen, 87.55% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 17m 57s. Estimated total time: 15h 48m 31s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 51s, 500 more iterations: 7h 54m 15s. [2026-03-26 00:55:08,955][__main__][INFO] - Starting iteration 660. [2026-03-26 00:55:08,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:55:08,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:55:14,343][__main__][INFO] - Number of regex retries in iteration 660: 0 [2026-03-26 00:55:14,345][__main__][INFO] - agents played in iteration 660 are Bob, Alice [2026-03-26 00:55:14,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:55:15,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:55:15,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:55:15,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:55:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:55:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:55:17,050][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:55:17,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:55:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:55:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:55:19,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:55:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:55:21,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:55:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:55:22,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:55:23,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:55:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:55:24,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:55:25,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:55:26,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:55:27,072][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:55:27,788][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:55:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:55:29,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:55:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:55:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:55:31,374][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:55:32,090][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:55:32,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:55:33,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:55:34,241][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:55:34,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:55:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:55:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:55:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:55:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:55:38,545][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:55:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:55:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:55:40,697][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:55:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:55:42,134][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:55:42,852][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:55:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:55:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:55:45,005][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:55:45,724][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:55:46,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:55:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:55:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:55:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:55:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:55:50,316][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:55:51,035][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:55:51,751][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:55:52,468][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:55:53,185][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:55:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:55:54,620][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:55:55,337][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:55:56,054][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:55:56,772][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:55:57,489][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:55:58,206][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:55:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:55:59,641][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:56:00,359][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:56:01,077][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:56:01,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:56:02,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:56:03,585][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:56:03,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:56:03,589][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:56:04,909][__main__][INFO] - Iteration 661 took 55s (9.63% Gen, 88.01% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 1m 1s. Estimated total time: 15h 32m 31s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 15s. [2026-03-26 00:56:04,913][__main__][INFO] - Starting iteration 661. [2026-03-26 00:56:04,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:56:04,921][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:56:09,856][__main__][INFO] - Number of regex retries in iteration 661: 0 [2026-03-26 00:56:09,857][__main__][INFO] - agents played in iteration 661 are Bob, Alice [2026-03-26 00:56:10,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:56:10,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:56:10,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:56:10,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:56:11,161][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:56:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:56:12,524][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:56:13,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:56:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:56:14,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:56:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:56:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:56:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:56:17,538][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:56:18,253][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:56:18,970][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:56:19,688][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:56:20,404][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:56:21,121][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:56:21,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:56:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:56:23,273][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:56:23,990][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:56:24,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:56:25,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:56:26,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:56:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:56:27,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:56:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:56:29,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:56:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:56:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:56:31,172][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:56:31,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:56:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:56:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:56:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:56:34,764][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:56:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:56:36,200][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:56:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:56:37,641][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:56:38,360][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:56:39,079][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:56:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:56:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:56:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:56:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:56:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:56:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:56:44,110][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:56:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:56:45,786][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:56:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:56:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:56:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:56:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:56:49,382][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:56:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:56:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:56:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:56:52,258][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:56:52,977][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:56:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:56:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:56:55,134][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:56:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:56:56,573][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:56:57,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:56:58,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:56:59,095][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:56:59,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:56:59,099][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:57:00,471][__main__][INFO] - Iteration 662 took 55s (8.88% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 53m 28s. Estimated total time: 15h 25m 54s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 57s. [2026-03-26 00:57:00,475][__main__][INFO] - Starting iteration 662. [2026-03-26 00:57:00,478][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:57:00,479][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:57:05,461][__main__][INFO] - Number of regex retries in iteration 662: 0 [2026-03-26 00:57:05,463][__main__][INFO] - agents played in iteration 662 are Bob, Alice [2026-03-26 00:57:05,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:57:06,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:57:06,048][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:57:06,049][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:57:06,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:57:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:57:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:57:08,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:57:09,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:57:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:57:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:57:11,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:57:12,433][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:57:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:57:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:57:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:57:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:57:16,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:57:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:57:17,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:57:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:57:18,905][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:57:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:57:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:57:21,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:57:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:57:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:57:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:57:23,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:57:24,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:57:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:57:26,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:57:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:57:27,525][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:57:28,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:57:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:57:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:57:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:57:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:57:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:57:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:57:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:57:33,990][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:57:34,709][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:57:35,426][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:57:36,145][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:57:36,864][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:57:37,582][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:57:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:57:39,019][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:57:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:57:40,457][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:57:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:57:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:57:42,871][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:57:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:57:44,310][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:57:45,029][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:57:45,748][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:57:46,467][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:57:47,186][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:57:47,906][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:57:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:57:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:57:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:57:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:57:51,506][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:57:52,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:57:52,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:57:53,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-26 00:57:54,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:57:54,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:57:54,845][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:57:56,112][__main__][INFO] - Iteration 663 took 55s (8.96% Gen, 88.76% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 53m 53s. Estimated total time: 15h 27m 15s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 37s. [2026-03-26 00:57:56,115][__main__][INFO] - Starting iteration 663. [2026-03-26 00:57:56,119][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:57:56,120][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:58:01,169][__main__][INFO] - Number of regex retries in iteration 663: 0 [2026-03-26 00:58:01,170][__main__][INFO] - agents played in iteration 663 are Bob, Alice [2026-03-26 00:58:01,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:58:01,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:58:01,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:58:01,740][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:58:02,463][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:58:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:58:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:58:04,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:58:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:58:05,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:58:06,707][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:58:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:58:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:58:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:58:09,581][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:58:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:58:11,018][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:58:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:58:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:58:13,175][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:58:13,895][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:58:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:58:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:58:17,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:58:18,016][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:58:18,734][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:58:19,452][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:58:20,170][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:58:20,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:58:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:58:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:58:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:58:23,764][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:58:24,483][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:58:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:58:25,922][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:58:26,642][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:58:27,361][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:58:28,079][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:58:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:58:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:58:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:58:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:58:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:58:32,402][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:58:33,121][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:58:33,840][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:58:34,561][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:58:35,280][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:58:36,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:58:36,717][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:58:43,280][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:58:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:58:47,033][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:58:47,748][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:58:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:58:49,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:58:49,898][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:58:50,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:58:51,333][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:58:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:58:52,770][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:58:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:58:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:58:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:58:55,639][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:58:56,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:58:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:58:57,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:58:58,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:56 [2026-03-26 00:58:59,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:58:59,674][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:58:59,676][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:59:00,990][__main__][INFO] - Iteration 664 took 1m 4s (7.78% Gen, 90.18% Train). Generation: 5s, Training: 58s. Estimated remaining time: 7h 26m 46s. Estimated total time: 18h 1m 13s. Time estimates for 10 more iterations: 10m 48s, 100 more iterations: 1h 48m 7s, 500 more iterations: 9h 0m 36s. [2026-03-26 00:59:00,993][__main__][INFO] - Starting iteration 664. [2026-03-26 00:59:00,996][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:59:00,997][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 00:59:06,219][__main__][INFO] - Number of regex retries in iteration 664: 0 [2026-03-26 00:59:06,220][__main__][INFO] - agents played in iteration 664 are Bob, Alice [2026-03-26 00:59:06,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:59:06,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 00:59:06,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 00:59:06,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 00:59:07,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 00:59:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 00:59:08,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 00:59:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 00:59:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 00:59:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 00:59:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 00:59:12,460][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 00:59:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 00:59:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 00:59:14,607][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 00:59:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 00:59:16,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 00:59:16,755][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 00:59:17,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 00:59:18,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 00:59:18,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 00:59:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 00:59:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 00:59:21,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 00:59:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 00:59:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 00:59:23,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 00:59:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 00:59:24,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 00:59:25,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 00:59:26,072][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 00:59:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 00:59:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 00:59:28,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 00:59:28,946][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 00:59:29,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 00:59:30,382][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 00:59:31,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 00:59:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 00:59:32,534][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 00:59:33,253][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 00:59:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 00:59:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 00:59:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 00:59:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 00:59:36,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 00:59:37,563][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 00:59:38,281][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 00:59:38,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 00:59:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 00:59:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 00:59:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 00:59:42,111][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 00:59:42,830][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 00:59:43,548][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 00:59:44,266][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 00:59:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 00:59:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 00:59:46,424][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 00:59:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 00:59:47,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 00:59:48,579][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 00:59:49,297][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 00:59:50,014][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 00:59:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 00:59:51,452][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 00:59:52,171][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 00:59:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 00:59:53,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 00:59:54,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 00:59:55,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 00:59:55,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 00:59:55,462][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 00:59:56,911][__main__][INFO] - Iteration 665 took 55s (9.34% Gen, 88.06% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 56m 33s. Estimated total time: 15h 31m 56s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 58s. [2026-03-26 00:59:56,914][__main__][INFO] - Starting iteration 665. [2026-03-26 00:59:56,919][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 00:59:56,919][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:00:01,944][__main__][INFO] - Number of regex retries in iteration 665: 0 [2026-03-26 01:00:01,945][__main__][INFO] - agents played in iteration 665 are Bob, Alice [2026-03-26 01:00:02,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:00:02,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:00:02,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:00:02,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:00:03,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:00:03,842][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:00:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:00:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:00:05,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:00:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:00:07,423][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:00:08,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:00:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:00:09,572][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:00:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:00:11,003][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:00:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:00:12,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:00:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:00:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:00:14,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:00:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:00:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:00:16,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:00:17,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:00:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:00:18,882][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:00:19,598][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:00:20,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:00:21,031][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:00:21,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:00:22,465][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:00:23,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:00:23,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:00:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:00:25,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:00:26,050][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:00:26,768][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:00:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:00:28,204][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:00:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:00:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:00:30,356][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:00:31,075][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:00:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:00:32,508][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:00:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:00:33,943][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:00:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:00:35,378][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:00:36,095][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:00:36,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:00:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:00:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:00:39,195][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:00:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:00:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:00:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:00:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:00:42,778][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:00:43,496][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:00:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:00:44,931][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:00:45,650][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:00:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:00:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:00:47,803][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:00:48,522][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:00:49,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:00:49,986][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:00:51,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:00:51,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:00:51,380][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:00:52,739][__main__][INFO] - Iteration 666 took 55s (9.00% Gen, 88.56% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 54m 4s. Estimated total time: 15h 30m 22s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 11s. [2026-03-26 01:00:52,742][__main__][INFO] - Starting iteration 666. [2026-03-26 01:00:52,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:00:52,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:00:57,846][__main__][INFO] - Number of regex retries in iteration 666: 0 [2026-03-26 01:00:57,848][__main__][INFO] - agents played in iteration 666 are Bob, Alice [2026-03-26 01:00:58,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:00:58,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:00:58,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:00:58,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:00:59,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:00:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:01:00,512][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:01:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:01:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:01:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:01:03,376][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:01:04,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:01:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:01:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:01:06,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:01:06,956][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:01:07,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:01:08,391][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:01:09,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:01:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:01:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:01:11,257][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:01:11,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:01:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:01:13,409][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:01:14,124][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:01:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:01:15,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:01:16,275][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:01:16,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:01:17,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:01:18,425][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:01:19,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:01:19,860][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:01:20,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:01:21,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:01:22,014][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:01:22,732][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:01:23,450][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:01:24,166][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:01:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:01:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:01:26,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:01:27,033][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:01:27,750][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:01:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:01:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:01:29,899][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:01:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:01:31,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:01:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:01:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:01:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:01:34,478][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:01:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:01:35,912][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:01:36,628][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:01:37,347][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:01:38,064][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:01:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:01:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:01:40,218][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:01:40,934][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:01:41,652][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:01:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:01:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:01:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:01:44,520][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:01:45,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:01:45,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:01:47,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:01:47,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:01:47,206][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:01:48,454][__main__][INFO] - Iteration 667 took 55s (9.16% Gen, 88.60% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 51m 15s. Estimated total time: 15h 28m 29s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 14s. [2026-03-26 01:01:48,457][__main__][INFO] - Starting iteration 667. [2026-03-26 01:01:48,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:01:48,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:01:53,669][__main__][INFO] - Number of regex retries in iteration 667: 0 [2026-03-26 01:01:53,670][__main__][INFO] - agents played in iteration 667 are Bob, Alice [2026-03-26 01:01:54,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:01:54,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:01:54,238][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:01:54,239][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:01:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:01:55,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:01:56,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:01:57,003][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:01:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:01:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:01:59,151][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:01:59,868][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:02:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:02:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:02:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:02:02,732][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:02:03,447][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:02:04,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:02:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:02:05,597][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:02:06,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:02:07,028][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:02:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:02:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:02:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:02:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:02:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:02:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:02:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:02:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:02:13,482][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:02:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:02:14,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:02:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:02:16,352][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:02:17,069][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:02:17,787][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:02:18,504][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:02:19,221][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:02:19,939][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:02:20,655][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:02:21,374][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:02:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:02:22,807][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:02:23,523][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:02:24,242][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:02:24,958][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:02:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:02:26,392][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:02:27,108][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:02:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:02:28,541][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:02:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:02:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:02:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:02:31,635][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:02:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:02:33,069][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:02:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:02:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:02:35,220][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:02:35,937][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:02:36,654][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:02:37,371][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:02:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:02:38,806][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:02:39,523][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:02:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:02:40,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:02:41,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:02:42,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:02:42,900][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:02:42,902][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:02:44,270][__main__][INFO] - Iteration 668 took 55s (9.33% Gen, 88.22% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 51m 59s. Estimated total time: 15h 30m 9s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 4s. [2026-03-26 01:02:44,272][__main__][INFO] - Starting iteration 668. [2026-03-26 01:02:44,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:02:44,276][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:02:49,241][__main__][INFO] - Number of regex retries in iteration 668: 0 [2026-03-26 01:02:49,242][__main__][INFO] - agents played in iteration 668 are Bob, Alice [2026-03-26 01:02:49,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:02:49,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:02:49,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:02:49,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:02:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:02:51,220][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:02:51,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:02:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:02:53,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:02:54,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:02:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:02:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:02:56,228][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:02:56,946][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:02:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:02:58,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:02:59,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:02:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:03:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:03:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:03:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:03:02,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:03:03,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:03:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:03:04,828][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:03:05,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:03:06,264][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:03:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:03:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:03:08,415][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:03:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:03:09,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:03:10,569][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:03:11,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:03:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:03:12,722][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:03:13,437][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:03:14,155][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:03:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:03:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:03:16,303][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:03:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:03:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:03:18,454][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:03:19,170][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:03:19,888][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:03:20,604][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:03:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:03:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:03:22,755][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:03:23,472][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:03:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:03:25,144][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:03:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:03:26,576][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:03:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:03:28,012][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:03:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:03:29,446][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:03:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:03:30,883][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:03:31,599][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:03:32,316][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:03:33,034][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:03:33,750][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:03:34,470][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:03:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:03:35,906][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:03:36,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:03:37,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:03:38,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:03:38,440][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:03:38,441][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:03:39,814][__main__][INFO] - Iteration 669 took 55s (8.94% Gen, 88.58% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 46m 34s. Estimated total time: 15h 25m 40s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 50s. [2026-03-26 01:03:39,818][__main__][INFO] - Starting iteration 669. [2026-03-26 01:03:39,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:03:39,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:03:44,732][__main__][INFO] - Number of regex retries in iteration 669: 0 [2026-03-26 01:03:44,733][__main__][INFO] - agents played in iteration 669 are Bob, Alice [2026-03-26 01:03:45,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:03:45,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:03:45,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:03:45,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:03:46,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:03:46,667][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:03:47,387][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:03:48,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:03:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:03:49,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:03:50,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:03:50,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:03:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:03:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:03:53,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:03:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:03:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:03:55,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:03:55,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:03:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:03:57,416][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:03:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:03:58,850][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:03:59,569][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:04:00,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:04:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:04:01,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:04:02,440][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:04:03,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:04:03,876][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:04:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:04:05,310][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:04:06,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:04:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:04:07,462][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:04:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:04:08,897][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:04:09,615][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:04:10,334][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:04:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:04:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:04:12,482][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:04:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:04:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:04:14,635][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:04:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:04:16,068][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:04:16,785][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:04:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:04:18,218][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:04:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:04:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:04:20,673][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:04:21,392][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:04:22,109][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:04:22,826][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:04:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:04:24,259][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:04:24,977][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:04:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:04:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:04:27,130][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:04:27,848][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:04:28,564][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:04:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:04:29,999][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:04:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:04:31,434][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:04:32,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:04:32,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:04:34,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:04:34,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:04:34,164][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:04:35,491][__main__][INFO] - Iteration 670 took 55s (8.82% Gen, 88.79% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 47m 48s. Estimated total time: 15h 27m 49s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 54s. [2026-03-26 01:04:35,494][__main__][INFO] - Starting iteration 670. [2026-03-26 01:04:35,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:04:35,498][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:04:40,453][__main__][INFO] - Number of regex retries in iteration 670: 0 [2026-03-26 01:04:40,454][__main__][INFO] - agents played in iteration 670 are Bob, Alice [2026-03-26 01:04:40,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:04:41,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:04:41,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:04:41,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:04:41,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:04:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:04:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:04:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:04:44,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:04:45,229][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:04:45,947][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:04:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:04:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:04:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:04:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:04:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:04:50,244][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:04:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:04:51,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:04:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:04:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:04:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:04:54,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:04:55,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:04:55,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:04:56,695][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:04:57,414][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:04:58,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:04:58,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:04:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:05:00,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:05:00,996][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:05:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:05:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:05:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:05:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:05:04,585][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:05:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:05:06,017][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:05:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:05:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:05:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:05:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:05:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:05:10,318][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:05:11,033][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:05:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:05:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:05:13,182][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:05:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:05:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:05:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:05:16,275][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:05:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:05:17,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:05:18,427][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:05:19,144][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:05:19,862][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:05:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:05:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:05:22,012][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:05:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:05:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:05:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:05:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:05:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:05:26,313][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:05:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:05:27,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:05:28,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:05:29,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:05:29,687][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:05:29,692][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:05:31,176][__main__][INFO] - Iteration 671 took 55s (8.90% Gen, 88.43% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 47m 3s. Estimated total time: 15h 28m 0s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 48s, 500 more iterations: 7h 44m 0s. [2026-03-26 01:05:31,180][__main__][INFO] - Starting iteration 671. [2026-03-26 01:05:31,187][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:05:31,188][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:05:36,141][__main__][INFO] - Number of regex retries in iteration 671: 0 [2026-03-26 01:05:36,142][__main__][INFO] - agents played in iteration 671 are Bob, Alice [2026-03-26 01:05:36,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:05:36,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:05:36,701][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:05:36,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:05:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:05:38,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:05:38,751][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:05:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:05:40,182][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:05:40,898][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:05:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:05:42,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:05:43,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:05:43,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:05:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:05:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:05:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:05:46,625][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:05:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:05:48,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:05:48,777][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:05:49,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:05:50,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:05:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:05:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:05:52,359][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:05:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:05:53,794][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:05:54,510][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:05:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:05:55,945][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:05:56,663][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:05:57,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:05:58,098][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:05:58,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:05:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:06:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:06:00,964][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:06:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:06:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:06:03,113][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:06:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:06:04,547][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:06:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:06:05,980][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:06:06,697][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:06:07,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:06:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:06:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:06:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:06:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:06:10,995][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:06:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:06:12,657][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:06:13,373][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:06:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:06:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:06:15,524][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:06:16,240][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:06:16,958][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:06:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:06:18,393][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:06:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:06:19,827][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:06:20,547][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:06:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:06:21,983][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:06:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:06:23,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:06:24,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:06:25,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:06:25,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:06:25,319][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:06:26,718][__main__][INFO] - Iteration 672 took 55s (8.92% Gen, 88.55% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 43m 41s. Estimated total time: 15h 25m 34s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 47s. [2026-03-26 01:06:26,721][__main__][INFO] - Starting iteration 672. [2026-03-26 01:06:26,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:06:26,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:06:32,232][__main__][INFO] - Number of regex retries in iteration 672: 0 [2026-03-26 01:06:32,233][__main__][INFO] - agents played in iteration 672 are Bob, Alice [2026-03-26 01:06:32,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:06:32,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:06:32,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:06:32,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:06:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:06:34,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:06:34,888][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:06:35,605][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:06:36,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:06:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:06:37,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:06:38,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:06:39,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:06:39,896][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:06:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:06:41,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:06:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:06:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:06:43,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:06:44,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:06:44,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:06:45,624][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:06:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:06:47,058][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:06:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:06:48,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:06:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:06:49,926][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:06:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:06:51,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:06:52,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:06:52,794][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:06:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:06:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:06:54,942][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:06:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:06:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:06:57,090][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:06:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:06:58,522][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:06:59,238][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:06:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:07:00,671][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:07:01,388][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:07:02,105][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:07:02,822][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:07:03,538][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:07:04,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:07:04,971][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:07:05,687][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:07:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:07:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:07:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:07:08,876][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:07:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:07:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:07:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:07:11,743][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:07:12,460][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:07:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:07:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:07:14,611][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:07:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:07:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:07:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:07:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:07:18,198][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:07:18,915][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:07:19,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:07:20,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:07:21,634][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:07:21,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:07:21,640][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:07:23,136][__main__][INFO] - Iteration 673 took 56s (9.76% Gen, 87.58% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 57m 24s. Estimated total time: 15h 40m 13s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 6s. [2026-03-26 01:07:23,138][__main__][INFO] - Starting iteration 673. [2026-03-26 01:07:23,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:07:23,144][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:07:28,058][__main__][INFO] - Number of regex retries in iteration 673: 0 [2026-03-26 01:07:28,059][__main__][INFO] - agents played in iteration 673 are Bob, Alice [2026-03-26 01:07:28,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:07:28,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:07:28,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:07:28,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:07:29,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:07:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:07:30,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:07:31,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:07:32,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:07:32,819][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:07:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:07:34,252][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:07:34,966][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:07:35,684][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:07:36,399][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:07:37,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:07:37,832][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:07:38,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:07:39,266][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:07:39,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:07:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:07:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:07:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:07:42,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:07:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:07:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:07:45,000][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:07:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:07:46,435][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:07:47,150][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:07:47,866][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:07:48,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:07:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:07:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:07:50,731][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:07:51,446][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:07:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:07:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:07:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:07:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:07:55,027][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:07:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:07:56,460][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:07:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:07:57,893][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:07:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:07:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:08:00,043][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:08:00,760][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:08:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:08:02,192][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:08:02,910][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:08:03,853][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:08:04,569][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:08:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:08:06,003][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:08:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:08:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:08:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:08:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:08:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:08:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:08:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:08:11,745][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:08:12,462][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:08:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:08:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:08:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:08:15,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:08:16,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:08:17,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:08:17,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:08:17,449][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:08:18,865][__main__][INFO] - Iteration 674 took 55s (8.82% Gen, 88.63% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 45m 0s. Estimated total time: 15h 28m 44s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 22s. [2026-03-26 01:08:18,868][__main__][INFO] - Starting iteration 674. [2026-03-26 01:08:18,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:08:18,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:08:26,716][__main__][INFO] - Number of regex retries in iteration 674: 0 [2026-03-26 01:08:26,717][__main__][INFO] - agents played in iteration 674 are Bob, Alice [2026-03-26 01:08:27,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:08:27,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:08:27,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:08:27,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:08:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:08:28,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:08:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:08:30,041][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:08:30,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:08:31,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:08:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:08:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:08:33,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:08:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:08:35,044][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:08:35,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:08:36,474][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:08:37,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:08:37,904][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:08:38,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:08:39,337][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:08:40,055][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:08:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:08:41,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:08:42,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:08:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:08:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:08:44,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:08:45,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:08:45,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:08:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:08:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:08:47,933][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:08:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:08:49,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:08:50,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:08:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:08:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:08:52,233][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:08:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:08:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:08:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:08:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:08:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:08:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:08:57,249][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:08:57,964][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:08:58,681][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:08:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:09:00,113][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:09:00,828][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:09:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:09:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:09:03,205][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:09:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:09:04,637][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:09:05,356][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:09:06,071][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:09:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:09:07,505][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:09:08,223][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:09:08,940][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:09:09,657][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:09:10,374][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:09:11,091][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:09:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:09:12,525][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:09:13,241][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:09:13,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:09:14,695][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:09:15,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:09:15,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:09:15,811][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:09:17,135][__main__][INFO] - Iteration 675 took 58s (13.43% Gen, 84.29% Train). Generation: 7s, Training: 49s. Estimated remaining time: 5h 26m 2s. Estimated total time: 16h 10m 45s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 4s, 500 more iterations: 8h 5m 22s. [2026-03-26 01:09:17,141][__main__][INFO] - Starting iteration 675. [2026-03-26 01:09:17,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:09:17,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:09:22,111][__main__][INFO] - Number of regex retries in iteration 675: 0 [2026-03-26 01:09:22,113][__main__][INFO] - agents played in iteration 675 are Bob, Alice [2026-03-26 01:09:22,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:09:22,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:09:22,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:09:22,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:09:23,394][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:09:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:09:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:09:25,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:09:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:09:26,900][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:09:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:09:28,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:09:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:09:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:09:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:09:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:09:31,910][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:09:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:09:33,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:09:34,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:09:34,774][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:09:35,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:09:36,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:09:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:09:37,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:09:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:09:39,076][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:09:39,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:09:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:09:41,227][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:09:41,942][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:09:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:09:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:09:44,090][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:09:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:09:45,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:09:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:09:46,954][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:09:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:09:48,389][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:09:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:09:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:09:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:09:51,255][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:09:51,975][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:09:52,690][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:09:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:09:54,125][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:09:54,842][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:09:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:09:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:09:56,991][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:09:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:09:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:09:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:10:00,169][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:10:00,886][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:10:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:10:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:10:03,037][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:10:03,756][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:10:04,472][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:10:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:10:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:10:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:10:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:10:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:10:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:10:09,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:10:10,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:10:11,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:10:11,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:10:11,585][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:10:13,243][__main__][INFO] - Iteration 676 took 56s (8.85% Gen, 88.19% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 49m 20s. Estimated total time: 15h 34m 59s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 29s. [2026-03-26 01:10:13,247][__main__][INFO] - Starting iteration 676. [2026-03-26 01:10:13,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:10:13,252][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:10:19,460][__main__][INFO] - Number of regex retries in iteration 676: 0 [2026-03-26 01:10:19,461][__main__][INFO] - agents played in iteration 676 are Bob, Alice [2026-03-26 01:10:20,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:10:20,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:10:20,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:10:20,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:10:20,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:10:21,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:10:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:10:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:10:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:10:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:10:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:10:25,734][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:10:26,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:10:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:10:27,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:10:28,598][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:10:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:10:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:10:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:10:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:10:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:10:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:10:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:10:34,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:10:35,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:10:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:10:36,481][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:10:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:10:37,914][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:10:38,632][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:10:39,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:10:40,066][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:10:40,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:10:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:10:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:10:42,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:10:43,645][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:10:44,361][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:10:45,078][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:10:45,794][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:10:46,511][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:10:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:10:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:10:48,661][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:10:49,377][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:10:50,094][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:10:50,810][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:10:51,528][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:10:52,245][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:10:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:10:53,679][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:10:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:10:55,357][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:10:56,073][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:10:56,790][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:10:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:10:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:10:58,942][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:10:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:11:00,377][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:11:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:11:01,814][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:11:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:11:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:11:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:11:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:11:05,401][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:11:06,120][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:11:06,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:11:07,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:11:08,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:11:08,725][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:11:08,728][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:11:11,928][__main__][INFO] - Iteration 677 took 58s (10.58% Gen, 83.96% Train). Generation: 6s, Training: 49s. Estimated remaining time: 5h 31m 21s. Estimated total time: 16h 17m 58s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 47s, 500 more iterations: 8h 8m 59s. [2026-03-26 01:11:11,931][__main__][INFO] - Starting iteration 677. [2026-03-26 01:11:11,935][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:11:11,936][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:11:17,041][__main__][INFO] - Number of regex retries in iteration 677: 0 [2026-03-26 01:11:17,043][__main__][INFO] - agents played in iteration 677 are Bob, Alice [2026-03-26 01:11:17,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:11:17,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:11:17,633][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:11:17,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:11:18,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:11:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:11:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:11:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:11:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:11:21,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:11:22,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:11:23,256][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:11:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:11:24,687][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:11:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:11:26,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:11:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:11:27,554][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:11:28,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:11:28,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:11:29,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:11:30,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:11:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:11:31,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:11:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:11:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:11:34,006][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:11:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:11:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:11:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:11:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:11:37,591][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:11:38,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:11:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:11:39,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:11:40,459][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:11:41,174][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:11:41,892][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:11:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:11:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:11:44,042][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:11:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:11:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:11:46,193][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:11:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:11:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:11:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:11:49,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:11:49,776][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:11:50,494][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:11:51,209][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:11:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:11:52,871][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:11:53,589][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:11:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:11:55,023][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:11:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:11:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:11:57,174][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:11:57,890][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:11:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:11:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:12:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:12:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:12:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:12:02,194][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:12:02,913][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:12:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:12:04,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:12:05,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:12:06,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:12:06,359][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:12:06,360][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:12:07,716][__main__][INFO] - Iteration 678 took 55s (9.16% Gen, 88.41% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 42m 10s. Estimated total time: 15h 29m 43s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 51s. [2026-03-26 01:12:07,719][__main__][INFO] - Starting iteration 678. [2026-03-26 01:12:07,724][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:12:07,724][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:12:12,640][__main__][INFO] - Number of regex retries in iteration 678: 0 [2026-03-26 01:12:12,642][__main__][INFO] - agents played in iteration 678 are Bob, Alice [2026-03-26 01:12:13,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:12:13,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:12:13,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:12:13,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:12:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:12:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:12:15,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:12:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:12:16,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:12:17,402][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:12:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:12:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:12:19,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:12:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:12:20,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:12:21,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:12:22,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:12:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:12:23,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:12:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:12:25,282][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:12:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:12:26,715][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:12:27,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:12:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:12:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:12:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:12:30,296][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:12:31,014][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:12:31,730][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:12:32,447][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:12:33,164][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:12:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:12:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:12:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:12:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:12:36,746][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:12:37,462][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:12:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:12:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:12:39,613][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:12:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:12:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:12:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:12:42,480][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:12:43,198][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:12:43,914][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:12:44,631][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:12:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:12:46,064][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:12:46,780][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:12:47,499][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:12:48,501][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:12:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:12:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:12:50,655][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:12:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:12:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:12:52,807][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:12:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:12:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:12:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:12:55,677][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:12:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:12:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:12:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:12:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:12:59,264][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:12:59,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:13:00,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:13:01,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:13:01,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:13:01,751][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:13:03,082][__main__][INFO] - Iteration 679 took 55s (8.88% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 34m 11s. Estimated total time: 15h 22m 39s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 19s. [2026-03-26 01:13:03,086][__main__][INFO] - Starting iteration 679. [2026-03-26 01:13:03,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:13:03,094][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:13:08,001][__main__][INFO] - Number of regex retries in iteration 679: 0 [2026-03-26 01:13:08,002][__main__][INFO] - agents played in iteration 679 are Bob, Alice [2026-03-26 01:13:08,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:13:08,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:13:08,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:13:08,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:13:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:13:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:13:10,622][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:13:11,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:13:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:13:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:13:13,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:13:14,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:13:14,917][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:13:15,635][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:13:16,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:13:17,066][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:13:17,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:13:18,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:13:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:13:19,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:13:20,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:13:21,364][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:13:22,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:13:22,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:13:23,511][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:13:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:13:24,942][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:13:25,659][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:13:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:13:27,090][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:13:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:13:28,524][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:13:29,242][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:13:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:13:30,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:13:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:13:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:13:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:13:33,540][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:13:34,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:13:34,974][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:13:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:13:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:13:37,125][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:13:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:13:38,558][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:13:39,276][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:13:39,994][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:13:40,710][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:13:41,429][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:13:42,144][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:13:42,860][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:13:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:13:44,551][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:13:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:13:45,985][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:13:46,701][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:13:47,418][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:13:48,135][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:13:48,851][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:13:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:13:50,285][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:13:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:13:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:13:52,441][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:13:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:13:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:13:54,594][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:13:55,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:13:56,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:13:57,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:13:57,129][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:13:57,131][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:13:58,687][__main__][INFO] - Iteration 680 took 55s (8.83% Gen, 88.37% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 37m 13s. Estimated total time: 15h 26m 38s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 19s. [2026-03-26 01:13:58,690][__main__][INFO] - Starting iteration 680. [2026-03-26 01:13:58,694][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:13:58,694][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:14:03,774][__main__][INFO] - Number of regex retries in iteration 680: 0 [2026-03-26 01:14:03,775][__main__][INFO] - agents played in iteration 680 are Bob, Alice [2026-03-26 01:14:04,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:14:04,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:14:04,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:14:04,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:14:05,280][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:14:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:14:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:14:07,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:14:08,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:14:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:14:09,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:14:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:14:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:14:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:14:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:14:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:14:13,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:14:14,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:14:15,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:14:15,960][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:14:16,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:14:17,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:14:18,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:14:18,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:14:19,545][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:14:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:14:20,980][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:14:21,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:14:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:14:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:14:23,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:14:24,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:14:25,277][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:14:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:14:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:14:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:14:28,146][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:14:28,862][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:14:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:14:30,296][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:14:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:14:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:14:32,445][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:14:33,162][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:14:33,878][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:14:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:14:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:14:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:14:36,745][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:14:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:14:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:14:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:14:39,848][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:14:40,565][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:14:41,282][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:14:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:14:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:14:43,436][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:14:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:14:44,871][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:14:45,587][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:14:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:14:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:14:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:14:48,457][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:14:49,173][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:14:49,891][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:14:50,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:14:51,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:14:52,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:14:53,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:14:53,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:14:53,073][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:14:54,504][__main__][INFO] - Iteration 681 took 55s (9.10% Gen, 88.33% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 39m 51s. Estimated total time: 15h 30m 11s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 5s. [2026-03-26 01:14:54,507][__main__][INFO] - Starting iteration 681. [2026-03-26 01:14:54,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:14:54,512][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:14:59,720][__main__][INFO] - Number of regex retries in iteration 681: 0 [2026-03-26 01:14:59,720][__main__][INFO] - agents played in iteration 681 are Bob, Alice [2026-03-26 01:15:00,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:15:00,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:15:00,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:15:00,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:15:00,994][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:15:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:15:02,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:15:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:15:03,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:15:04,524][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:15:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:15:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:15:06,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:15:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:15:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:15:08,824][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:15:09,541][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:15:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:15:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:15:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:15:12,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:15:13,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:15:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:15:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:15:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:15:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:15:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:15:17,430][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:15:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:15:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:15:19,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:15:20,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:15:21,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:15:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:15:22,449][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:15:23,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:15:23,881][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:15:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:15:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:15:26,030][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:15:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:15:27,464][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:15:28,180][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:15:28,898][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:15:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:15:30,331][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:15:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:15:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:15:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:15:33,197][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:15:33,913][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:15:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:15:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:15:36,310][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:15:37,028][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:15:37,743][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:15:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:15:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:15:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:15:40,611][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:15:41,329][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:15:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:15:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:15:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:15:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:15:44,918][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:15:45,635][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:15:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:15:47,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:15:47,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:15:49,109][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:15:49,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:15:49,116][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:15:50,419][__main__][INFO] - Iteration 682 took 55s (9.32% Gen, 88.35% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 40m 33s. Estimated total time: 15h 31m 50s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 55s. [2026-03-26 01:15:50,423][__main__][INFO] - Starting iteration 682. [2026-03-26 01:15:50,429][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:15:50,430][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:15:55,371][__main__][INFO] - Number of regex retries in iteration 682: 0 [2026-03-26 01:15:55,372][__main__][INFO] - agents played in iteration 682 are Bob, Alice [2026-03-26 01:15:55,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:15:55,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:15:55,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:15:55,942][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:15:56,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:15:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:15:58,021][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:15:58,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:15:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:16:00,166][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:16:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:16:01,598][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:16:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:16:03,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:16:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:16:04,464][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:16:05,181][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:16:05,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:16:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:16:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:16:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:16:08,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:16:09,481][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:16:10,197][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:16:10,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:16:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:16:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:16:13,064][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:16:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:16:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:16:15,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:16:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:16:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:16:17,364][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:16:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:16:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:16:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:16:20,230][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:16:20,947][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:16:21,662][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:16:22,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:16:23,096][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:16:23,812][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:16:24,531][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:16:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:16:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:16:26,679][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:16:27,399][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:16:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:16:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:16:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:16:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:16:31,245][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:16:31,964][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:16:32,681][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:16:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:16:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:16:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:16:35,550][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:16:36,270][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:16:36,985][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:16:37,704][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:16:38,420][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:16:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:16:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:16:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:16:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:16:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:16:42,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:16:43,450][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:16:44,468][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:16:44,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:16:44,474][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:16:45,769][__main__][INFO] - Iteration 683 took 55s (8.93% Gen, 88.72% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 30m 10s. Estimated total time: 15h 22m 22s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 11s. [2026-03-26 01:16:45,771][__main__][INFO] - Starting iteration 683. [2026-03-26 01:16:45,777][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:16:45,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:16:50,948][__main__][INFO] - Number of regex retries in iteration 683: 0 [2026-03-26 01:16:50,949][__main__][INFO] - agents played in iteration 683 are Bob, Alice [2026-03-26 01:16:51,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:16:51,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:16:51,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:16:51,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:16:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:16:52,910][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:16:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:16:54,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:16:55,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:16:55,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:16:56,489][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:16:57,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:16:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:16:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:16:59,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:17:00,068][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:17:00,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:17:01,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:17:02,219][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:17:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:17:03,653][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:17:04,369][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:17:05,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:17:05,806][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:17:06,522][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:17:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:17:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:17:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:17:09,393][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:17:10,108][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:17:10,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:17:11,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:17:12,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:17:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:17:13,690][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:17:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:17:15,125][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:17:15,841][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:17:16,558][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:17:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:17:17,991][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:17:18,707][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:17:19,424][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:17:20,141][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:17:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:17:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:17:22,290][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:17:23,008][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:17:23,724][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:17:24,442][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:17:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:17:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:17:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:17:27,541][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:17:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:17:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:17:29,691][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:17:30,407][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:17:31,125][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:17:31,843][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:17:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:17:33,280][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:17:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:17:34,714][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:17:35,432][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:17:36,150][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:17:36,867][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:17:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:17:38,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:17:39,040][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:17:40,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:17:40,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:17:40,020][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:17:41,410][__main__][INFO] - Iteration 684 took 55s (9.29% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 34m 8s. Estimated total time: 15h 27m 15s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 37s. [2026-03-26 01:17:41,414][__main__][INFO] - Starting iteration 684. [2026-03-26 01:17:41,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:17:41,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:17:46,536][__main__][INFO] - Number of regex retries in iteration 684: 0 [2026-03-26 01:17:46,537][__main__][INFO] - agents played in iteration 684 are Bob, Alice [2026-03-26 01:17:47,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:17:47,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:17:47,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:17:47,154][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:17:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:17:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:17:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:17:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:17:50,634][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:17:51,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:17:52,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:17:52,781][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:17:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:17:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:17:54,929][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:17:55,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:17:56,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:17:57,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:17:57,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:17:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:17:59,230][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:17:59,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:18:00,663][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:18:01,379][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:18:02,099][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:18:02,816][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:18:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:18:04,251][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:18:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:18:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:18:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:18:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:18:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:18:08,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:18:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:18:09,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:18:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:18:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:18:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:18:12,852][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:18:13,569][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:18:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:18:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:18:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:18:16,435][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:18:17,152][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:18:17,868][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:18:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:18:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:18:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:18:20,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:18:21,455][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:18:22,402][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:18:23,119][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:18:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:18:24,553][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:18:25,271][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:18:25,988][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:18:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:18:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:18:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:18:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:18:29,574][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:18:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:18:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:18:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:18:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:18:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:18:33,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:18:34,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:18:35,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:18:35,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:18:35,924][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:18:37,373][__main__][INFO] - Iteration 685 took 55s (9.14% Gen, 88.26% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 38m 32s. Estimated total time: 15h 32m 35s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 17s. [2026-03-26 01:18:37,375][__main__][INFO] - Starting iteration 685. [2026-03-26 01:18:37,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:18:37,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:18:42,366][__main__][INFO] - Number of regex retries in iteration 685: 0 [2026-03-26 01:18:42,367][__main__][INFO] - agents played in iteration 685 are Bob, Alice [2026-03-26 01:18:43,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:18:43,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:18:43,240][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:18:43,240][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:18:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:18:44,631][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:18:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:18:46,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:18:46,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:18:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:18:48,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:18:48,927][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:18:49,646][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:18:50,360][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:18:51,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:18:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:18:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:18:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:18:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:18:54,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:18:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:18:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:18:56,808][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:18:57,526][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:18:58,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:18:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:18:59,677][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:19:00,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:19:01,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:19:01,829][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:19:02,545][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:19:03,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:19:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:19:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:19:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:19:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:19:06,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:19:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:19:08,278][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:19:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:19:09,712][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:19:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:19:11,144][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:19:11,861][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:19:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:19:13,293][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:19:14,010][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:19:14,727][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:19:15,443][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:19:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:19:16,880][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:19:17,595][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:19:18,594][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:19:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:19:20,029][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:19:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:19:21,463][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:19:22,179][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:19:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:19:23,614][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:19:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:19:25,048][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:19:25,765][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:19:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:19:27,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:19:27,918][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:19:28,635][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:19:29,353][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:19:30,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:19:30,803][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:19:31,818][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:19:31,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:19:31,823][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:19:33,262][__main__][INFO] - Iteration 686 took 55s (8.92% Gen, 88.50% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 36m 25s. Estimated total time: 15h 31m 24s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 42s. [2026-03-26 01:19:33,267][__main__][INFO] - Starting iteration 686. [2026-03-26 01:19:33,272][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:19:33,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:19:41,340][__main__][INFO] - Number of regex retries in iteration 686: 0 [2026-03-26 01:19:41,342][__main__][INFO] - agents played in iteration 686 are Bob, Alice [2026-03-26 01:19:41,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:19:41,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:19:41,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:19:41,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:19:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:19:43,252][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:19:43,967][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:19:44,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:19:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:19:46,110][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:19:46,825][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:19:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:19:48,256][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:19:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:19:49,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:19:50,403][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:19:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:19:51,835][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:19:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:19:53,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:19:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:19:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:19:55,413][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:19:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:19:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:19:57,561][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:19:58,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:19:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:19:59,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:20:00,509][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:20:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:20:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:20:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:20:03,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:20:04,091][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:20:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:20:05,525][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:20:06,241][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:20:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:20:07,675][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:20:08,393][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:20:09,111][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:20:09,828][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:20:10,545][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:20:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:20:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:20:12,695][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:20:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:20:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:20:14,844][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:20:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:20:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:20:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:20:17,936][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:20:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:20:19,368][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:20:20,084][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:20:20,801][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:20:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:20:22,235][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:20:22,951][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:20:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:20:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:20:25,104][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:20:25,821][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:20:26,538][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:20:27,254][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:20:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:20:28,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:20:29,434][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:20:30,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:20:30,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:20:30,435][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:20:31,840][__main__][INFO] - Iteration 687 took 58s (13.78% Gen, 83.82% Train). Generation: 8s, Training: 49s. Estimated remaining time: 5h 20m 11s. Estimated total time: 16h 16m 9s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 36s, 500 more iterations: 8h 8m 4s. [2026-03-26 01:20:31,843][__main__][INFO] - Starting iteration 687. [2026-03-26 01:20:31,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:20:31,853][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:20:36,875][__main__][INFO] - Number of regex retries in iteration 687: 0 [2026-03-26 01:20:36,876][__main__][INFO] - agents played in iteration 687 are Bob, Alice [2026-03-26 01:20:37,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:20:37,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:20:37,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:20:37,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:20:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:20:38,781][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:20:39,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:20:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:20:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:20:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:20:42,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:20:43,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:20:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:20:44,502][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:20:45,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:20:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:20:46,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:20:47,366][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:20:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:20:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:20:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:20:50,232][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:20:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:20:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:20:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:20:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:20:53,815][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:20:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:20:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:20:55,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:20:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:20:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:20:58,119][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:20:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:20:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:21:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:21:00,983][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:21:01,699][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:21:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:21:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:21:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:21:04,565][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:21:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:21:05,998][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:21:06,715][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:21:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:21:08,149][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:21:08,866][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:21:09,582][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:21:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:21:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:21:11,732][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:21:12,685][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:21:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:21:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:21:14,834][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:21:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:21:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:21:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:21:17,701][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:21:18,418][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:21:19,136][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:21:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:21:20,571][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:21:21,289][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:21:22,005][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:21:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:21:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:21:24,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:21:24,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:21:25,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:21:25,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:21:25,868][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:21:27,432][__main__][INFO] - Iteration 688 took 55s (9.04% Gen, 88.14% Train). Generation: 5s, Training: 48s. Estimated remaining time: 4h 29m 29s. Estimated total time: 15h 26m 22s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 11s. [2026-03-26 01:21:27,434][__main__][INFO] - Starting iteration 688. [2026-03-26 01:21:27,438][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:21:27,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:21:32,477][__main__][INFO] - Number of regex retries in iteration 688: 0 [2026-03-26 01:21:32,479][__main__][INFO] - agents played in iteration 688 are Bob, Alice [2026-03-26 01:21:32,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:21:33,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:21:33,048][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:21:33,049][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:21:33,794][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:21:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:21:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:21:35,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:21:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:21:37,307][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:21:38,022][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:21:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:21:39,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:21:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:21:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:21:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:21:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:21:43,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:21:43,752][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:21:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:21:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:21:45,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:21:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:21:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:21:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:21:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:21:49,488][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:21:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:21:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:21:51,638][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:21:52,354][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:21:53,071][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:21:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:21:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:21:55,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:21:55,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:21:56,652][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:21:57,368][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:21:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:21:58,801][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:21:59,518][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:22:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:22:00,952][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:22:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:22:02,387][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:22:03,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:22:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:22:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:22:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:22:05,980][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:22:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:22:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:22:08,409][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:22:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:22:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:22:10,565][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:22:11,285][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:22:12,004][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:22:12,725][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:22:13,442][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:22:14,162][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:22:14,880][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:22:15,599][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:22:16,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:22:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:22:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:22:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:22:19,193][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:22:19,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:22:20,649][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:22:21,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:22:21,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:22:21,664][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:22:23,022][__main__][INFO] - Iteration 689 took 55s (9.07% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 28m 37s. Estimated total time: 15h 26m 25s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 12s. [2026-03-26 01:22:23,025][__main__][INFO] - Starting iteration 689. [2026-03-26 01:22:23,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:22:23,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:22:28,018][__main__][INFO] - Number of regex retries in iteration 689: 0 [2026-03-26 01:22:28,019][__main__][INFO] - agents played in iteration 689 are Bob, Alice [2026-03-26 01:22:28,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:22:28,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:22:28,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:22:28,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:22:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:22:29,913][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:22:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:22:31,348][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:22:32,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:22:32,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:22:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:22:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:22:34,928][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:22:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:22:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:22:37,078][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:22:37,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:22:38,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:22:39,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:22:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:22:40,661][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:22:41,378][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:22:42,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:22:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:22:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:22:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:22:44,965][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:22:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:22:46,406][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:22:47,127][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:22:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:22:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:22:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:22:49,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:22:50,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:22:51,426][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:22:52,142][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:22:52,859][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:22:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:22:54,292][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:22:55,009][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:22:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:22:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:22:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:22:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:22:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:22:59,318][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:23:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:23:00,753][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:23:01,472][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:23:02,191][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:23:02,908][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:23:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:23:04,587][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:23:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:23:06,024][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:23:06,744][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:23:07,461][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:23:08,180][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:23:08,898][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:23:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:23:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:23:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:23:11,770][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:23:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:23:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:23:13,926][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:23:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:23:15,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:23:16,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:23:17,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:23:17,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:23:17,240][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:23:18,611][__main__][INFO] - Iteration 690 took 55s (8.97% Gen, 88.55% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 27m 39s. Estimated total time: 15h 26m 24s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 12s. [2026-03-26 01:23:18,614][__main__][INFO] - Starting iteration 690. [2026-03-26 01:23:18,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:23:18,619][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:23:25,030][__main__][INFO] - Number of regex retries in iteration 690: 0 [2026-03-26 01:23:25,031][__main__][INFO] - agents played in iteration 690 are Bob, Alice [2026-03-26 01:23:25,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:23:25,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:23:25,606][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:23:25,607][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:23:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:23:26,963][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:23:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:23:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:23:29,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:23:29,833][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:23:30,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:23:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:23:31,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:23:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:23:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:23:34,136][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:23:34,853][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:23:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:23:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:23:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:23:37,720][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:23:38,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:23:39,154][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:23:39,870][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:23:40,586][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:23:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:23:42,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:23:42,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:23:43,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:23:44,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:23:44,884][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:23:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:23:46,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:23:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:23:47,751][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:23:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:23:49,182][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:23:49,899][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:23:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:23:51,330][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:23:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:23:52,762][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:23:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:23:54,194][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:23:54,911][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:23:55,627][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:23:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:23:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:23:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:23:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:23:59,210][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:23:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:24:00,870][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:24:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:24:02,305][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:24:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:24:03,738][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:24:04,456][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:24:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:24:05,890][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:24:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:24:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:24:08,042][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:24:08,759][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:24:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:24:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:24:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:24:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:24:12,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:24:13,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:24:14,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:24:14,146][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:24:14,149][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:24:15,481][__main__][INFO] - Iteration 691 took 56s (11.28% Gen, 86.38% Train). Generation: 6s, Training: 49s. Estimated remaining time: 4h 48m 4s. Estimated total time: 15h 47m 45s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 46s, 500 more iterations: 7h 53m 52s. [2026-03-26 01:24:15,484][__main__][INFO] - Starting iteration 691. [2026-03-26 01:24:15,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:24:15,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:24:20,490][__main__][INFO] - Number of regex retries in iteration 691: 0 [2026-03-26 01:24:20,491][__main__][INFO] - agents played in iteration 691 are Bob, Alice [2026-03-26 01:24:21,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:24:21,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:24:21,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:24:21,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:24:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:24:22,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:24:23,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:24:23,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:24:24,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:24:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:24:26,097][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:24:26,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:24:27,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:24:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:24:28,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:24:29,678][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:24:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:24:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:24:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:24:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:24:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:24:33,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:24:34,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:24:35,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:24:36,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:24:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:24:37,558][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:24:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:24:38,993][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:24:39,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:24:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:24:41,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:24:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:24:42,579][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:24:43,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:24:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:24:44,731][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:24:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:24:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:24:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:24:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:24:48,312][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:24:49,030][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:24:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:24:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:24:51,179][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:24:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:24:52,612][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:24:53,329][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:24:54,045][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:24:54,761][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:24:55,477][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:24:56,470][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:24:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:24:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:24:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:24:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:25:00,055][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:25:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:25:01,490][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:25:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:25:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:25:03,641][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:25:04,357][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:25:05,073][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:25:05,790][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:25:06,507][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:25:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:25:07,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:25:08,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:25:09,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:25:09,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:25:09,683][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:25:11,019][__main__][INFO] - Iteration 692 took 55s (9.01% Gen, 88.58% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 24m 55s. Estimated total time: 15h 25m 32s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 46s. [2026-03-26 01:25:11,023][__main__][INFO] - Starting iteration 692. [2026-03-26 01:25:11,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:25:11,030][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:25:15,948][__main__][INFO] - Number of regex retries in iteration 692: 0 [2026-03-26 01:25:15,949][__main__][INFO] - agents played in iteration 692 are Bob, Alice [2026-03-26 01:25:16,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:25:16,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:25:16,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:25:16,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:25:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:25:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:25:18,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:25:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:25:20,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:25:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:25:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:25:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:25:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:25:23,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:25:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:25:25,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:25:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:25:26,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:25:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:25:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:25:28,625][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:25:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:25:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:25:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:25:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:25:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:25:32,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:25:33,641][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:25:34,358][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:25:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:25:35,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:25:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:25:37,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:25:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:25:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:25:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:25:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:25:40,811][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:25:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:25:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:25:42,962][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:25:43,678][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:25:44,395][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:25:45,113][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:25:45,828][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:25:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:25:47,261][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:25:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:25:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:25:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:25:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:25:50,844][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:25:51,788][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:25:52,505][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:25:53,222][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:25:53,939][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:25:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:25:55,373][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:25:56,089][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:25:56,807][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:25:57,524][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:25:58,241][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:25:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:25:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:26:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:26:01,110][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:26:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:26:02,543][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:26:03,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:26:04,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:26:04,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:26:05,001][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:26:05,003][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:26:06,428][__main__][INFO] - Iteration 693 took 55s (8.88% Gen, 88.54% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 21m 49s. Estimated total time: 15h 23m 21s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 20s, 500 more iterations: 7h 41m 40s. [2026-03-26 01:26:06,431][__main__][INFO] - Starting iteration 693. [2026-03-26 01:26:06,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:26:06,437][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:26:11,424][__main__][INFO] - Number of regex retries in iteration 693: 0 [2026-03-26 01:26:11,426][__main__][INFO] - agents played in iteration 693 are Bob, Alice [2026-03-26 01:26:11,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:26:11,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:26:11,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:26:11,997][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:26:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:26:13,323][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:26:14,040][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:26:14,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:26:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:26:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:26:16,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:26:17,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:26:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:26:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:26:19,766][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:26:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:26:21,197][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:26:21,912][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:26:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:26:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:26:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:26:24,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:26:25,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:26:26,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:26:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:26:27,645][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:26:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:26:29,079][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:26:29,795][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:26:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:26:31,229][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:26:31,950][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:26:32,666][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:26:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:26:34,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:26:34,817][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:26:35,534][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:26:36,252][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:26:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:26:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:26:38,401][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:26:39,117][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:26:39,835][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:26:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:26:41,268][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:26:41,985][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:26:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:26:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:26:44,134][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:26:44,852][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:26:45,568][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:26:46,285][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:26:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:26:47,969][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:26:48,686][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:26:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:26:50,121][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:26:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:26:51,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:26:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:26:52,987][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:26:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:26:54,425][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:26:55,142][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:26:55,858][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:26:56,577][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:26:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:26:58,010][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:26:58,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:26:59,494][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:27:00,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:27:00,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:27:00,600][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:27:01,960][__main__][INFO] - Iteration 694 took 55s (8.98% Gen, 88.56% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 22m 58s. Estimated total time: 15h 25m 26s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 32s, 500 more iterations: 7h 42m 43s. [2026-03-26 01:27:01,963][__main__][INFO] - Starting iteration 694. [2026-03-26 01:27:01,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:27:01,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:27:06,869][__main__][INFO] - Number of regex retries in iteration 694: 0 [2026-03-26 01:27:06,871][__main__][INFO] - agents played in iteration 694 are Bob, Alice [2026-03-26 01:27:07,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:27:07,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:27:07,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:27:07,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:27:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:27:08,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:27:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:27:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:27:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:27:11,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:27:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:27:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:27:13,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:27:14,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:27:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:27:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:27:16,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:27:17,389][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:27:18,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:27:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:27:19,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:27:20,257][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:27:20,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:27:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:27:22,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:27:23,123][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:27:23,841][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:27:24,557][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:27:25,275][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:27:25,992][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:27:26,710][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:27:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:27:28,144][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:27:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:27:29,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:27:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:27:31,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:27:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:27:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:27:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:27:33,877][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:27:34,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:27:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:27:36,028][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:27:36,744][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:27:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:27:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:27:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:27:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:27:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:27:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:27:41,766][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:27:42,737][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:27:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:27:44,173][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:27:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:27:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:27:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:27:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:27:47,756][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:27:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:27:49,192][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:27:49,909][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:27:50,629][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:27:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:27:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:27:52,785][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:27:53,503][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:27:54,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:27:54,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:27:56,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:27:56,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:27:56,125][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:27:57,392][__main__][INFO] - Iteration 695 took 55s (8.85% Gen, 88.86% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 20m 24s. Estimated total time: 15h 23m 47s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 53s. [2026-03-26 01:27:57,395][__main__][INFO] - Starting iteration 695. [2026-03-26 01:27:57,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:27:57,400][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:28:02,589][__main__][INFO] - Number of regex retries in iteration 695: 0 [2026-03-26 01:28:02,591][__main__][INFO] - agents played in iteration 695 are Bob, Alice [2026-03-26 01:28:03,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:28:03,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:28:03,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:28:03,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:28:04,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:28:04,697][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:28:05,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:28:06,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:28:06,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:28:07,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:28:08,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:28:08,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:28:09,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:28:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:28:11,140][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:28:11,856][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:28:12,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:28:13,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:28:14,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:28:14,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:28:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:28:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:28:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:28:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:28:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:28:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:28:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:28:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:28:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:28:22,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:28:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:28:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:28:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:28:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:28:25,810][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:28:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:28:27,244][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:28:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:28:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:28:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:28:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:28:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:28:31,552][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:28:32,268][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:28:32,985][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:28:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:28:34,418][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:28:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:28:35,852][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:28:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:28:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:28:38,003][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:28:38,950][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:28:39,672][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:28:40,389][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:28:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:28:41,821][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:28:42,539][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:28:43,256][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:28:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:28:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:28:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:28:46,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:28:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:28:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:28:48,275][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:28:48,995][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:28:49,712][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:28:50,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:28:51,158][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-26 01:28:52,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:28:52,313][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:28:52,315][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:28:58,581][__main__][INFO] - Iteration 696 took 1m 1s (8.48% Gen, 81.27% Train). Generation: 5s, Training: 49s. Estimated remaining time: 5h 55m 19s. Estimated total time: 16h 59m 44s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 58s, 500 more iterations: 8h 29m 52s. [2026-03-26 01:28:58,586][__main__][INFO] - Starting iteration 696. [2026-03-26 01:28:58,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:28:58,593][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:29:03,549][__main__][INFO] - Number of regex retries in iteration 696: 0 [2026-03-26 01:29:03,550][__main__][INFO] - agents played in iteration 696 are Bob, Alice [2026-03-26 01:29:04,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:29:04,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:29:04,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:29:04,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:29:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:29:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:29:06,166][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:29:06,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:29:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:29:08,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:29:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:29:09,735][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:29:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:29:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:29:11,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:29:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:29:13,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:29:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:29:14,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:29:15,447][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:29:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:29:16,876][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:29:17,592][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:29:18,307][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:29:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:29:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:29:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:29:21,169][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:29:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:29:22,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:29:23,315][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:29:24,031][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:29:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:29:25,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:29:26,178][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:29:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:29:27,611][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:29:28,327][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:29:29,042][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:29:29,758][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:29:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:29:31,191][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:29:31,907][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:29:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:29:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:29:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:29:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:29:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:29:36,206][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:29:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:29:37,640][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:29:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:29:39,315][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:29:40,034][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:29:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:29:41,468][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:29:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:29:42,902][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:29:43,618][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:29:44,336][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:29:45,053][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:29:45,772][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:29:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:29:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:29:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:29:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:29:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:29:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:29:50,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:29:51,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:29:52,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:29:52,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:29:52,557][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:29:53,882][__main__][INFO] - Iteration 697 took 55s (8.96% Gen, 88.63% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 16m 13s. Estimated total time: 15h 21m 32s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 9s, 500 more iterations: 7h 40m 46s. [2026-03-26 01:29:53,884][__main__][INFO] - Starting iteration 697. [2026-03-26 01:29:53,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:29:53,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:29:58,747][__main__][INFO] - Number of regex retries in iteration 697: 0 [2026-03-26 01:29:58,749][__main__][INFO] - agents played in iteration 697 are Bob, Alice [2026-03-26 01:29:59,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:29:59,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:29:59,351][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:29:59,352][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:30:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:30:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:30:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:30:02,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:30:02,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:30:03,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:30:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:30:04,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:30:05,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:30:06,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:30:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:30:07,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:30:08,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:30:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:30:10,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:30:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:30:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:30:12,158][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:30:12,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:30:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:30:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:30:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:30:15,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:30:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:30:17,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:30:17,887][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:30:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:30:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:30:20,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:30:20,753][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:30:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:30:22,188][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:30:22,904][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:30:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:30:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:30:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:30:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:30:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:30:27,205][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:30:27,922][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:30:28,638][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:30:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:30:30,071][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:30:30,789][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:30:31,505][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:30:32,221][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:30:32,938][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:30:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:30:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:30:35,356][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:30:36,073][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:30:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:30:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:30:38,222][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:30:38,940][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:30:39,656][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:30:40,373][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:30:41,089][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:30:41,806][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:30:42,523][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:30:43,241][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:30:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:30:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:30:45,391][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:30:46,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:30:46,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:30:48,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:30:48,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:30:48,044][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:30:49,401][__main__][INFO] - Iteration 698 took 55s (8.75% Gen, 88.80% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 18m 59s. Estimated total time: 15h 25m 14s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 37s. [2026-03-26 01:30:49,404][__main__][INFO] - Starting iteration 698. [2026-03-26 01:30:49,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:30:49,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:30:54,823][__main__][INFO] - Number of regex retries in iteration 698: 0 [2026-03-26 01:30:54,824][__main__][INFO] - agents played in iteration 698 are Bob, Alice [2026-03-26 01:30:55,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:30:55,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:30:55,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:30:55,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:30:56,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:30:56,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:30:57,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:30:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:30:58,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:30:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:31:00,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:31:01,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:31:01,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:31:02,520][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:31:03,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:31:03,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:31:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:31:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:31:06,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:31:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:31:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:31:08,250][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:31:08,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:31:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:31:10,400][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:31:11,116][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:31:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:31:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:31:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:31:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:31:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:31:15,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:31:16,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:31:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:31:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:31:18,285][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:31:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:31:19,718][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:31:20,434][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:31:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:31:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:31:22,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:31:23,299][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:31:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:31:24,732][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:31:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:31:26,164][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:31:26,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:31:27,598][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:31:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:31:29,032][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:31:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:31:30,692][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:31:31,409][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:31:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:31:32,843][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:31:33,560][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:31:34,277][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:31:34,995][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:31:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:31:36,428][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:31:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:31:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:31:38,582][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:31:39,302][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:31:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:31:40,736][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:31:41,453][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:31:42,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:31:42,900][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:31:44,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:31:44,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:31:44,319][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:31:45,720][__main__][INFO] - Iteration 699 took 56s (9.61% Gen, 87.90% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 31m 17s. Estimated total time: 15h 38m 29s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 14s. [2026-03-26 01:31:45,722][__main__][INFO] - Starting iteration 699. [2026-03-26 01:31:45,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:31:45,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:31:50,957][__main__][INFO] - Number of regex retries in iteration 699: 0 [2026-03-26 01:31:50,958][__main__][INFO] - agents played in iteration 699 are Bob, Alice [2026-03-26 01:31:51,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:31:51,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:31:51,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:31:51,570][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:31:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:31:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:31:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:31:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:31:55,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:31:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:31:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:31:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:31:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:31:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:31:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:32:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:32:00,770][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:32:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:32:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:32:02,919][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:32:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:32:04,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:32:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:32:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:32:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:32:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:32:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:32:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:32:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:32:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:32:10,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:32:11,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:32:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:32:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:32:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:32:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:32:15,106][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:32:15,821][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:32:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:32:17,254][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:32:17,972][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:32:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:32:19,404][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:32:20,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:32:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:32:21,554][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:32:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:32:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:32:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:32:24,421][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:32:25,137][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:32:25,852][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:32:26,799][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:32:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:32:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:32:28,950][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:32:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:32:30,383][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:32:31,099][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:32:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:32:32,533][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:32:33,250][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:32:33,967][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:32:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:32:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:32:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:32:36,835][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:32:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:32:38,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:32:39,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:32:40,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:32:40,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:32:40,030][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:32:41,364][__main__][INFO] - Iteration 700 took 55s (9.40% Gen, 88.20% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 19m 12s. Estimated total time: 15h 27m 19s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 39s. [2026-03-26 01:32:41,368][__main__][INFO] - Starting iteration 700. [2026-03-26 01:32:41,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2026-03-26 01:32:41,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:32:46,858][__main__][INFO] - Number of regex retries in iteration 700: 0 [2026-03-26 01:32:46,859][__main__][INFO] - agents played in iteration 700 are Bob, Alice [2026-03-26 01:32:47,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:32:47,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:32:47,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:32:47,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:32:48,166][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:32:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:32:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:32:50,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:32:50,959][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:32:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:32:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:32:53,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:32:53,821][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:32:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:32:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:32:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:32:56,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:32:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:32:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:32:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:32:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:33:00,265][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:33:00,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:33:01,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:33:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:33:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:33:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:33:04,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:33:05,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:33:05,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:33:06,712][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:33:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:33:08,147][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:33:08,866][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:33:09,582][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:33:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:33:11,016][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:33:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:33:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:33:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:33:13,886][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:33:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:33:15,319][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:33:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:33:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:33:17,470][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:33:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:33:18,903][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:33:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:33:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:33:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:33:21,769][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:33:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:33:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:33:24,196][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:33:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:33:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:33:26,346][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:33:27,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:33:27,779][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:33:28,496][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:33:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:33:29,931][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:33:30,648][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:33:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:33:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:33:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:33:33,518][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:33:34,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:33:34,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:33:36,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:33:36,170][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:33:36,172][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:33:38,784][__main__][INFO] - Iteration 701 took 57s (9.55% Gen, 85.89% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 47m 48s. Estimated total time: 15h 56m 53s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 41s, 500 more iterations: 7h 58m 26s. [2026-03-26 01:33:38,787][__main__][INFO] - Starting iteration 701. [2026-03-26 01:33:38,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:33:38,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:33:49,708][__main__][INFO] - Number of regex retries in iteration 701: 0 [2026-03-26 01:33:49,710][__main__][INFO] - agents played in iteration 701 are Bob, Alice [2026-03-26 01:33:50,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:33:50,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:33:50,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:33:50,280][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:33:50,961][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:33:51,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:33:52,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:33:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:33:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:33:54,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:33:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:33:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:33:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:33:57,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:33:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:33:58,729][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:33:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:34:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:34:00,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:34:01,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:34:02,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:34:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:34:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:34:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:34:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:34:05,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:34:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:34:07,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:34:08,008][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:34:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:34:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:34:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:34:10,866][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:34:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:34:12,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:34:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:34:13,725][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:34:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:34:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:34:15,874][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:34:16,589][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:34:17,503][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:34:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:34:18,934][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:34:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:34:20,365][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:34:21,081][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:34:21,797][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:34:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:34:23,228][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:34:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:34:24,660][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:34:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:34:26,315][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:34:27,031][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:34:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:34:28,464][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:34:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:34:29,896][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:34:30,613][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:34:31,329][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:34:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:34:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:34:33,478][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:34:34,196][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:34:34,913][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:34:35,629][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:34:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:34:37,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:34:37,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:34:38,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:34:38,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:34:38,922][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:34:40,310][__main__][INFO] - Iteration 702 took 1m 1s (17.75% Gen, 79.99% Train). Generation: 10s, Training: 49s. Estimated remaining time: 5h 55m 15s. Estimated total time: 17h 5m 21s. Time estimates for 10 more iterations: 10m 15s, 100 more iterations: 1h 42m 32s, 500 more iterations: 8h 32m 40s. [2026-03-26 01:34:40,313][__main__][INFO] - Starting iteration 702. [2026-03-26 01:34:40,317][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:34:40,318][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:34:45,239][__main__][INFO] - Number of regex retries in iteration 702: 0 [2026-03-26 01:34:45,241][__main__][INFO] - agents played in iteration 702 are Bob, Alice [2026-03-26 01:34:45,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:34:45,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:34:45,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:34:45,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:34:46,496][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:34:47,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:34:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:34:48,572][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:34:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:34:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:34:50,715][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:34:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:34:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:34:52,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:34:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:34:54,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:34:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:34:55,719][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:34:56,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:34:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:34:57,863][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:34:58,579][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:34:59,295][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:35:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:35:00,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:35:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:35:02,157][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:35:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:35:03,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:35:04,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:35:05,023][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:35:05,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:35:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:35:07,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:35:07,891][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:35:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:35:09,326][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:35:10,044][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:35:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:35:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:35:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:35:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:35:13,624][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:35:14,343][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:35:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:35:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:35:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:35:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:35:17,927][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:35:18,647][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:35:19,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:35:20,080][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:35:21,028][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:35:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:35:22,462][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:35:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:35:23,896][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:35:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:35:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:35:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:35:26,764][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:35:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:35:28,196][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:35:28,912][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:35:29,629][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:35:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:35:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:35:31,780][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:35:32,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:35:33,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:35:34,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:35:34,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:35:34,377][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:35:35,672][__main__][INFO] - Iteration 703 took 55s (8.89% Gen, 88.76% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 11m 35s. Estimated total time: 15h 22m 36s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 18s. [2026-03-26 01:35:35,674][__main__][INFO] - Starting iteration 703. [2026-03-26 01:35:35,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:35:35,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:35:41,339][__main__][INFO] - Number of regex retries in iteration 703: 0 [2026-03-26 01:35:41,340][__main__][INFO] - agents played in iteration 703 are Bob, Alice [2026-03-26 01:35:41,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:35:41,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:35:41,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:35:41,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:35:42,605][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:35:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:35:43,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:35:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:35:45,393][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:35:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:35:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:35:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:35:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:35:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:35:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:35:50,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:35:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:35:51,830][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:35:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:35:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:35:53,978][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:35:54,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:35:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:35:56,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:35:56,843][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:35:57,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:35:58,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:35:58,994][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:35:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:36:00,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:36:01,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:36:01,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:36:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:36:03,296][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:36:04,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:36:04,729][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:36:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:36:06,164][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:36:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:36:07,599][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:36:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:36:09,035][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:36:09,753][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:36:10,470][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:36:11,187][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:36:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:36:12,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:36:13,338][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:36:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:36:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:36:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:36:16,209][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:36:17,206][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:36:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:36:18,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:36:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:36:20,072][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:36:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:36:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:36:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:36:22,939][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:36:23,657][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:36:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:36:25,090][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:36:25,807][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:36:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:36:27,243][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:36:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:36:28,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:36:29,451][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:36:30,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:36:30,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:36:30,449][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:36:31,925][__main__][INFO] - Iteration 704 took 56s (10.06% Gen, 87.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 25m 30s. Estimated total time: 15h 37m 28s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 44s. [2026-03-26 01:36:31,927][__main__][INFO] - Starting iteration 704. [2026-03-26 01:36:31,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:36:31,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:36:36,908][__main__][INFO] - Number of regex retries in iteration 704: 0 [2026-03-26 01:36:36,909][__main__][INFO] - agents played in iteration 704 are Bob, Alice [2026-03-26 01:36:37,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:36:37,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:36:37,474][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:36:37,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:36:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:36:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:36:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:36:40,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:36:40,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:36:41,692][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:36:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:36:43,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:36:43,838][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:36:44,552][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:36:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:36:45,985][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:36:46,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:36:47,418][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:36:48,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:36:48,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:36:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:36:50,283][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:36:51,000][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:36:51,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:36:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:36:53,150][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:36:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:36:54,582][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:36:55,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:36:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:36:56,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:36:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:36:58,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:36:58,883][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:36:59,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:37:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:37:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:37:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:37:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:37:03,185][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:37:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:37:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:37:05,339][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:37:06,056][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:37:06,774][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:37:07,492][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:37:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:37:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:37:09,641][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:37:10,360][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:37:11,077][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:37:11,794][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:37:12,744][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:37:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:37:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:37:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:37:15,609][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:37:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:37:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:37:17,761][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:37:18,478][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:37:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:37:19,913][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:37:20,631][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:37:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:37:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:37:22,780][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:37:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:37:24,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:37:24,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:37:25,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:37:25,964][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:37:25,965][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:37:27,339][__main__][INFO] - Iteration 705 took 55s (8.98% Gen, 88.53% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 10m 36s. Estimated total time: 15h 23m 29s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 20s, 500 more iterations: 7h 41m 44s. [2026-03-26 01:37:27,342][__main__][INFO] - Starting iteration 705. [2026-03-26 01:37:27,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:37:27,347][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:37:32,360][__main__][INFO] - Number of regex retries in iteration 705: 0 [2026-03-26 01:37:32,361][__main__][INFO] - agents played in iteration 705 are Bob, Alice [2026-03-26 01:37:32,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:37:32,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:37:32,945][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:37:32,946][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:37:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:37:34,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:37:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:37:35,718][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:37:36,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:37:37,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:37:37,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:37:38,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:37:39,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:37:40,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:37:40,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:37:41,447][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:37:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:37:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:37:43,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:37:44,313][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:37:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:37:45,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:37:46,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:37:47,182][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:37:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:37:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:37:49,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:37:50,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:37:50,765][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:37:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:37:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:37:52,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:37:53,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:37:54,353][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:37:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:37:55,789][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:37:56,507][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:37:57,225][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:37:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:37:58,661][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:37:59,380][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:38:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:38:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:38:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:38:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:38:02,964][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:38:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:38:04,396][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:38:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:38:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:38:06,547][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:38:07,263][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:38:08,212][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:38:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:38:11,251][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:38:11,969][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:38:12,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:38:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:38:14,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:38:14,843][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:38:15,561][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:38:16,278][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:38:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:38:17,712][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:38:18,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:38:19,146][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:38:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:38:20,578][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:38:21,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:38:22,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:48 [2026-03-26 01:38:23,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:38:23,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:38:23,047][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:38:24,447][__main__][INFO] - Iteration 706 took 57s (8.78% Gen, 88.77% Train). Generation: 5s, Training: 50s. Estimated remaining time: 4h 37m 52s. Estimated total time: 15h 51m 42s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 10s, 500 more iterations: 7h 55m 51s. [2026-03-26 01:38:24,450][__main__][INFO] - Starting iteration 706. [2026-03-26 01:38:24,454][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:38:24,454][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:38:29,425][__main__][INFO] - Number of regex retries in iteration 706: 0 [2026-03-26 01:38:29,426][__main__][INFO] - agents played in iteration 706 are Bob, Alice [2026-03-26 01:38:30,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:38:30,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:38:30,075][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:38:30,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:38:30,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:38:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:38:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:38:32,846][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:38:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:38:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:38:34,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:38:35,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:38:36,423][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:38:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:38:37,855][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:38:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:38:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:38:40,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:38:40,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:38:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:38:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:38:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:38:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:38:44,302][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:38:45,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:38:45,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:38:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:38:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:38:47,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:38:48,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:38:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:38:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:38:50,753][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:38:51,471][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:38:52,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:38:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:38:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:38:54,341][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:38:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:38:55,775][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:38:56,492][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:38:57,208][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:38:57,925][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:38:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:38:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:39:00,079][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:39:00,797][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:39:01,520][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:39:02,236][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:39:02,952][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:39:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:39:04,384][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:39:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:39:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:39:06,815][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:39:07,530][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:39:08,248][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:39:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:39:09,681][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:39:10,399][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:39:11,115][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:39:11,832][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:39:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:39:13,265][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:39:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:39:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:39:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:39:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:39:16,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:39:17,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:39:18,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:39:18,716][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:39:18,718][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:39:23,571][__main__][INFO] - Iteration 707 took 59s (8.41% Gen, 83.38% Train). Generation: 4s, Training: 49s. Estimated remaining time: 5h 10m 30s. Estimated total time: 16h 25m 19s. Time estimates for 10 more iterations: 9m 51s, 100 more iterations: 1h 38m 31s, 500 more iterations: 8h 12m 39s. [2026-03-26 01:39:23,574][__main__][INFO] - Starting iteration 707. [2026-03-26 01:39:23,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:39:23,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:39:28,505][__main__][INFO] - Number of regex retries in iteration 707: 0 [2026-03-26 01:39:28,506][__main__][INFO] - agents played in iteration 707 are Bob, Alice [2026-03-26 01:39:29,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:39:29,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:39:29,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:39:29,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:39:29,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:39:30,436][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:39:31,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:39:31,867][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:39:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:39:33,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:39:34,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:39:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:39:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:39:36,151][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:39:36,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:39:37,580][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:39:38,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:39:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:39:39,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:39:40,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:39:41,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:39:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:39:42,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:39:43,302][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:39:44,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:39:44,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:39:45,447][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:39:46,165][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:39:46,883][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:39:47,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:39:48,320][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:39:49,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:39:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:39:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:39:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:39:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:39:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:39:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:39:54,049][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:39:54,767][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:39:55,482][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:39:56,198][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:39:56,916][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:39:57,632][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:39:58,349][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:39:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:39:59,782][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:40:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:40:01,215][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:40:01,934][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:40:02,650][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:40:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:40:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:40:05,033][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:40:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:40:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:40:07,182][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:40:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:40:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:40:09,336][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:40:10,052][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:40:10,771][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:40:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:40:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:40:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:40:13,639][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:40:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:40:15,075][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:40:15,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:40:16,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:40:17,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:40:17,779][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:40:17,781][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:40:19,151][__main__][INFO] - Iteration 708 took 55s (8.87% Gen, 88.66% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 10m 30s. Estimated total time: 15h 26m 15s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 37s, 500 more iterations: 7h 43m 7s. [2026-03-26 01:40:19,153][__main__][INFO] - Starting iteration 708. [2026-03-26 01:40:19,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:40:19,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:40:25,860][__main__][INFO] - Number of regex retries in iteration 708: 0 [2026-03-26 01:40:25,861][__main__][INFO] - agents played in iteration 708 are Bob, Alice [2026-03-26 01:40:26,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:40:26,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:40:26,435][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:40:26,435][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:40:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:40:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:40:28,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:40:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:40:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:40:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:40:31,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:40:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:40:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:40:33,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:40:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:40:34,905][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:40:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:40:36,337][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:40:37,052][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:40:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:40:38,484][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:40:39,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:40:39,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:40:40,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:40:41,351][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:40:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:40:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:40:43,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:40:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:40:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:40:45,644][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:40:46,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:40:47,077][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:40:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:40:48,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:40:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:40:49,942][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:40:50,660][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:40:51,376][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:40:52,092][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:40:52,807][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:40:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:40:54,242][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:40:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:40:55,676][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:40:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:40:57,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:40:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:40:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:40:59,264][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:40:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:41:00,698][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:41:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:41:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:41:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:41:03,793][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:41:04,510][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:41:05,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:41:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:41:06,668][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:41:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:41:08,105][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:41:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:41:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:41:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:41:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:41:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:41:12,409][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:41:13,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:41:13,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:41:14,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:41:14,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:41:14,846][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:41:17,008][__main__][INFO] - Iteration 709 took 57s (11.59% Gen, 84.67% Train). Generation: 6s, Training: 48s. Estimated remaining time: 4h 47m 29s. Estimated total time: 16h 4m 12s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 25s, 500 more iterations: 8h 2m 6s. [2026-03-26 01:41:17,012][__main__][INFO] - Starting iteration 709. [2026-03-26 01:41:17,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:41:17,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:41:21,927][__main__][INFO] - Number of regex retries in iteration 709: 0 [2026-03-26 01:41:21,928][__main__][INFO] - agents played in iteration 709 are Bob, Alice [2026-03-26 01:41:22,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:41:22,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:41:22,496][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:41:22,497][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:41:23,181][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:41:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:41:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:41:25,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:41:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:41:26,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:41:27,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:41:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:41:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:41:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:41:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:41:30,978][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:41:31,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:41:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:41:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:41:33,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:41:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:41:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:41:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:41:36,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:41:37,420][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:41:38,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:41:38,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:41:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:41:40,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:41:41,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:41:41,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:41:42,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:41:43,153][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:41:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:41:44,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:41:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:41:46,019][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:41:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:41:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:41:48,169][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:41:48,887][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:41:49,603][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:41:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:41:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:41:51,755][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:41:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:41:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:41:53,905][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:41:54,621][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:41:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:41:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:41:56,773][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:41:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:41:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:41:59,166][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:41:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:42:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:42:01,318][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:42:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:42:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:42:03,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:42:04,192][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:42:04,909][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:42:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:42:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:42:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:42:07,775][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:42:08,491][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:42:09,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:42:09,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:42:11,238][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:42:11,243][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:42:11,245][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:42:12,529][__main__][INFO] - Iteration 710 took 55s (8.84% Gen, 88.84% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 7m 35s. Estimated total time: 15h 25m 13s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 36s. [2026-03-26 01:42:12,532][__main__][INFO] - Starting iteration 710. [2026-03-26 01:42:12,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:42:12,536][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:42:17,512][__main__][INFO] - Number of regex retries in iteration 710: 0 [2026-03-26 01:42:17,513][__main__][INFO] - agents played in iteration 710 are Bob, Alice [2026-03-26 01:42:18,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:42:18,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:42:18,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:42:18,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:42:18,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:42:19,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:42:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:42:20,846][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:42:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:42:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:42:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:42:23,706][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:42:24,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:42:25,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:42:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:42:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:42:27,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:42:27,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:42:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:42:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:42:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:42:30,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:42:31,576][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:42:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:42:33,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:42:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:42:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:42:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:42:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:42:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:42:37,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:42:38,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:42:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:42:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:42:40,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:42:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:42:41,606][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:42:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:42:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:42:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:42:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:42:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:42:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:42:46,624][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:42:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:42:48,058][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:42:48,775][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:42:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:42:50,213][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:42:50,929][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:42:51,652][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:42:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:42:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:42:54,069][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:42:54,784][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:42:55,502][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:42:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:42:56,936][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:42:57,653][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:42:58,371][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:42:59,087][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:42:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:43:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:43:01,236][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:43:01,951][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:43:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:43:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:43:04,102][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:43:04,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:43:05,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:43:06,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:43:06,709][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:43:06,710][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:43:08,244][__main__][INFO] - Iteration 711 took 55s (8.93% Gen, 88.31% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 9m 56s. Estimated total time: 15h 28m 30s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 15s. [2026-03-26 01:43:08,248][__main__][INFO] - Starting iteration 711. [2026-03-26 01:43:08,252][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:43:08,253][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:43:13,482][__main__][INFO] - Number of regex retries in iteration 711: 0 [2026-03-26 01:43:13,483][__main__][INFO] - agents played in iteration 711 are Bob, Alice [2026-03-26 01:43:14,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:43:14,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:43:14,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:43:14,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:43:15,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:43:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:43:16,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:43:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:43:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:43:18,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:43:19,266][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:43:19,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:43:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:43:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:43:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:43:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:43:23,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:43:24,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:43:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:43:25,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:43:26,418][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:43:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:43:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:43:28,565][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:43:29,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:43:29,996][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:43:30,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:43:31,428][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:43:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:43:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:43:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:43:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:43:35,009][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:43:35,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:43:36,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:43:37,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:43:37,878][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:43:38,594][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:43:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:43:40,029][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:43:40,746][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:43:41,461][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:43:42,180][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:43:42,895][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:43:43,614][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:43:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:43:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:43:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:43:46,483][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:43:47,199][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:43:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:43:48,635][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:43:49,586][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:43:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:43:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:43:51,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:43:52,450][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:43:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:43:53,883][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:43:54,600][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:43:55,316][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:43:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:43:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:43:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:43:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:43:58,897][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:43:59,613][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:44:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:44:01,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:44:01,772][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:44:03,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:44:03,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:44:03,161][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:44:05,160][__main__][INFO] - Iteration 712 took 56s (9.19% Gen, 87.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 28m 58s. Estimated total time: 15h 48m 29s. Time estimates for 10 more iterations: 9m 29s, 100 more iterations: 1h 34m 50s, 500 more iterations: 7h 54m 14s. [2026-03-26 01:44:05,163][__main__][INFO] - Starting iteration 712. [2026-03-26 01:44:05,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:44:05,168][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:44:10,183][__main__][INFO] - Number of regex retries in iteration 712: 0 [2026-03-26 01:44:10,184][__main__][INFO] - agents played in iteration 712 are Bob, Alice [2026-03-26 01:44:10,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:44:10,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:44:10,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:44:10,752][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:44:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:44:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:44:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:44:13,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:44:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:44:14,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:44:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:44:16,375][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:44:17,092][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:44:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:44:18,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:44:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:44:19,952][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:44:20,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:44:21,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:44:22,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:44:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:44:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:44:24,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:44:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:44:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:44:26,397][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:44:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:44:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:44:28,546][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:44:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:44:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:44:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:44:31,413][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:44:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:44:32,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:44:33,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:44:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:44:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:44:35,713][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:44:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:44:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:44:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:44:38,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:44:39,300][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:44:40,015][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:44:40,730][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:44:41,447][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:44:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:44:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:44:43,596][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:44:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:44:45,029][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:44:45,987][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:44:46,705][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:44:47,420][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:44:48,138][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:44:48,853][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:44:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:44:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:44:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:44:51,721][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:44:52,439][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:44:53,156][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:44:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:44:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:44:55,305][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:44:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:44:56,738][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:44:57,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:44:58,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:44:59,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:44:59,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:44:59,306][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:45:01,577][__main__][INFO] - Iteration 713 took 56s (8.89% Gen, 87.08% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 19m 44s. Estimated total time: 15h 40m 11s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 5s. [2026-03-26 01:45:01,581][__main__][INFO] - Starting iteration 713. [2026-03-26 01:45:01,588][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:45:01,589][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:45:06,707][__main__][INFO] - Number of regex retries in iteration 713: 0 [2026-03-26 01:45:06,708][__main__][INFO] - agents played in iteration 713 are Bob, Alice [2026-03-26 01:45:07,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:45:07,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:45:07,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:45:07,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:45:08,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:45:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:45:09,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:45:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:45:10,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:45:11,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:45:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:45:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:45:13,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:45:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:45:15,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:45:15,894][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:45:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:45:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:45:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:45:18,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:45:19,470][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:45:20,190][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:45:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:45:21,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:45:22,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:45:23,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:45:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:45:24,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:45:25,228][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:45:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:45:26,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:45:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:45:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:45:28,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:45:29,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:45:30,266][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:45:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:45:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:45:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:45:33,142][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:45:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:45:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:45:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:45:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:45:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:45:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:45:38,157][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:45:38,873][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:45:39,590][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:45:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:45:41,024][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:45:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:45:42,725][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:45:43,442][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:45:44,157][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:45:44,874][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:45:45,590][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:45:46,308][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:45:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:45:47,742][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:45:48,458][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:45:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:45:49,891][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:45:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:45:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:45:52,043][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:45:52,758][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:45:53,476][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:45:54,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:45:54,953][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:45:56,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:45:56,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:45:56,031][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:45:57,457][__main__][INFO] - Iteration 714 took 55s (9.16% Gen, 88.28% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 9m 49s. Estimated total time: 15h 31m 12s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 7s, 500 more iterations: 7h 45m 36s. [2026-03-26 01:45:57,468][__main__][INFO] - Starting iteration 714. [2026-03-26 01:45:57,479][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:45:57,479][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:46:02,700][__main__][INFO] - Number of regex retries in iteration 714: 0 [2026-03-26 01:46:02,701][__main__][INFO] - agents played in iteration 714 are Bob, Alice [2026-03-26 01:46:03,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:46:03,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:46:03,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:46:03,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:46:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:46:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:46:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:46:06,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:46:06,856][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:46:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:46:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:46:09,002][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:46:09,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:46:10,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:46:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:46:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:46:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:46:13,300][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:46:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:46:14,733][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:46:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:46:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:46:16,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:46:17,599][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:46:18,316][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:46:19,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:46:19,749][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:46:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:46:21,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:46:21,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:46:22,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:46:23,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:46:24,052][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:46:24,768][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:46:25,485][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:46:26,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:46:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:46:27,636][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:46:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:46:29,070][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:46:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:46:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:46:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:46:31,934][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:46:32,651][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:46:33,368][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:46:34,085][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:46:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:46:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:46:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:46:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:46:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:46:38,615][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:46:39,331][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:46:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:46:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:46:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:46:42,196][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:46:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:46:43,632][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:46:44,348][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:46:45,066][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:46:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:46:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:46:47,216][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:46:47,933][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:46:48,650][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:46:49,367][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:46:50,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:46:50,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:46:51,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:46:51,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:46:51,973][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:46:53,337][__main__][INFO] - Iteration 715 took 55s (9.35% Gen, 88.21% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 8m 40s. Estimated total time: 15h 30m 59s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 29s. [2026-03-26 01:46:53,339][__main__][INFO] - Starting iteration 715. [2026-03-26 01:46:53,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:46:53,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:46:58,330][__main__][INFO] - Number of regex retries in iteration 715: 0 [2026-03-26 01:46:58,332][__main__][INFO] - agents played in iteration 715 are Bob, Alice [2026-03-26 01:46:58,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:46:58,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:46:58,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:46:58,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:46:59,598][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:47:00,243][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:47:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:47:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:47:02,390][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:47:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:47:03,822][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:47:04,537][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:47:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:47:05,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:47:06,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:47:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:47:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:47:08,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:47:09,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:47:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:47:10,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:47:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:47:12,413][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:47:13,131][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:47:13,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:47:14,562][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:47:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:47:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:47:16,715][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:47:17,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:47:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:47:18,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:47:19,584][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:47:20,300][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:47:21,019][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:47:21,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:47:22,452][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:47:23,167][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:47:23,884][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:47:24,599][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:47:25,317][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:47:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:47:26,751][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:47:27,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:47:28,185][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:47:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:47:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:47:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:47:31,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:47:31,768][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:47:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:47:33,202][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:47:34,167][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:47:34,884][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:47:35,600][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:47:36,317][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:47:37,033][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:47:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:47:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:47:39,186][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:47:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:47:40,622][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:47:41,337][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:47:42,055][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:47:42,773][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:47:43,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:47:44,208][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:47:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:47:45,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:47:46,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:47:47,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:47:47,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:47:47,861][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:47:49,179][__main__][INFO] - Iteration 716 took 55s (8.93% Gen, 88.70% Train). Generation: 4s, Training: 49s. Estimated remaining time: 4h 7m 22s. Estimated total time: 15h 30m 37s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 18s. [2026-03-26 01:47:49,183][__main__][INFO] - Starting iteration 716. [2026-03-26 01:47:49,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:47:49,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:47:54,356][__main__][INFO] - Number of regex retries in iteration 716: 0 [2026-03-26 01:47:54,357][__main__][INFO] - agents played in iteration 716 are Bob, Alice [2026-03-26 01:47:55,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:47:55,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:47:55,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:47:55,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:47:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:47:56,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:47:57,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:47:57,988][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:47:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:47:59,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:48:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:48:00,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:48:01,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:48:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:48:02,997][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:48:03,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:48:04,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:48:05,143][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:48:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:48:06,574][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:48:07,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:48:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:48:08,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:48:09,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:48:10,157][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:48:10,873][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:48:11,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:48:12,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:48:13,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:48:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:48:14,455][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:48:15,172][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:48:15,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:48:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:48:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:48:18,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:48:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:48:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:48:20,191][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:48:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:48:21,624][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:48:22,340][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:48:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:48:23,773][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:48:24,488][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:48:25,205][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:48:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:48:26,638][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:48:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:48:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:48:28,786][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:48:29,503][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:48:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:48:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:48:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:48:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:48:33,341][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:48:34,059][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:48:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:48:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:48:36,209][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:48:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:48:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:48:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:48:39,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:48:39,796][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:48:40,511][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:48:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:48:41,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:48:42,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:48:43,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:48:43,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:48:43,920][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:48:45,279][__main__][INFO] - Iteration 717 took 56s (9.21% Gen, 88.36% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 10m 41s. Estimated total time: 15h 34m 52s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 26s. [2026-03-26 01:48:45,281][__main__][INFO] - Starting iteration 717. [2026-03-26 01:48:45,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:48:45,286][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:48:50,296][__main__][INFO] - Number of regex retries in iteration 717: 0 [2026-03-26 01:48:50,298][__main__][INFO] - agents played in iteration 717 are Bob, Alice [2026-03-26 01:48:50,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:48:50,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:48:50,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:48:50,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:48:51,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:48:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:48:52,909][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:48:53,624][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:48:54,339][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:48:55,053][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:48:55,768][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:48:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:48:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:48:57,914][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:48:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:48:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:49:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:49:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:49:01,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:49:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:49:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:49:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:49:04,360][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:49:05,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:49:05,794][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:49:06,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:49:07,227][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:49:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:49:08,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:49:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:49:10,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:49:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:49:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:49:12,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:49:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:49:13,682][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:49:14,397][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:49:15,117][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:49:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:49:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:49:17,266][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:49:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:49:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:49:19,417][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:49:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:49:20,850][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:49:21,565][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:49:22,283][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:49:23,000][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:49:23,716][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:49:24,434][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:49:25,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:49:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:49:26,809][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:49:27,526][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:49:28,243][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:49:28,960][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:49:29,676][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:49:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:49:31,110][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:49:31,826][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:49:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:49:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:49:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:49:34,693][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:49:35,410][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:49:36,127][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:49:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:49:37,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:49:38,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:49:39,604][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:49:39,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:49:39,610][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:49:43,216][__main__][INFO] - Iteration 718 took 57s (8.65% Gen, 85.12% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 40m 23s. Estimated total time: 16h 5m 32s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 33s, 500 more iterations: 8h 2m 46s. [2026-03-26 01:49:43,218][__main__][INFO] - Starting iteration 718. [2026-03-26 01:49:43,222][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:49:43,222][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:49:48,243][__main__][INFO] - Number of regex retries in iteration 718: 0 [2026-03-26 01:49:48,244][__main__][INFO] - agents played in iteration 718 are Bob, Alice [2026-03-26 01:49:48,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:49:48,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:49:48,814][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:49:48,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:49:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:49:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:49:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:49:51,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:49:52,297][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:49:53,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:49:53,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:49:54,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:49:55,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:49:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:49:56,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:49:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:49:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:49:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:49:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:50:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:50:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:50:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:50:02,309][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:50:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:50:03,741][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:50:04,456][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:50:05,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:50:05,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:50:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:50:07,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:50:08,036][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:50:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:50:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:50:10,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:50:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:50:11,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:50:12,335][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:50:13,052][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:50:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:50:14,486][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:50:15,202][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:50:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:50:16,636][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:50:17,355][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:50:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:50:18,789][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:50:19,506][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:50:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:50:20,940][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:50:21,657][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:50:22,375][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:50:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:50:24,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:50:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:50:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:50:26,210][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:50:26,927][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:50:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:50:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:50:29,079][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:50:29,796][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:50:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:50:31,229][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:50:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:50:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:50:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:50:34,103][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:50:34,819][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:50:35,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:50:36,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:50:37,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:50:37,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:50:37,630][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:50:39,143][__main__][INFO] - Iteration 719 took 55s (8.98% Gen, 88.31% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 5m 58s. Estimated total time: 15h 32m 3s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 1s. [2026-03-26 01:50:39,146][__main__][INFO] - Starting iteration 719. [2026-03-26 01:50:39,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:50:39,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:50:44,091][__main__][INFO] - Number of regex retries in iteration 719: 0 [2026-03-26 01:50:44,092][__main__][INFO] - agents played in iteration 719 are Bob, Alice [2026-03-26 01:50:44,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:50:44,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:50:44,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:50:44,694][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:50:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:50:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:50:46,746][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:50:47,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:50:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:50:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:50:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:50:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:50:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:50:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:50:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:50:53,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:50:53,914][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:50:54,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:50:55,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:50:56,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:50:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:50:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:50:58,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:50:58,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:50:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:51:00,359][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:51:01,076][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:51:01,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:51:02,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:51:03,227][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:51:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:51:04,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:51:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:51:06,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:51:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:51:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:51:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:51:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:51:09,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:51:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:51:11,112][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:51:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:51:12,546][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:51:13,265][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:51:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:51:14,699][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:51:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:51:16,134][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:51:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:51:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:51:18,286][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:51:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:51:19,997][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:51:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:51:21,432][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:51:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:51:22,866][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:51:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:51:24,301][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:51:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:51:25,736][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:51:26,453][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:51:27,171][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:51:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:51:28,605][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:51:29,321][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:51:30,039][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:51:30,756][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:51:31,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:51:32,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:51:33,382][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:51:33,385][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:51:33,387][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:51:34,658][__main__][INFO] - Iteration 720 took 55s (8.90% Gen, 88.80% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 58m 9s. Estimated total time: 15h 25m 10s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 35s. [2026-03-26 01:51:34,661][__main__][INFO] - Starting iteration 720. [2026-03-26 01:51:34,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:51:34,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:51:39,652][__main__][INFO] - Number of regex retries in iteration 720: 0 [2026-03-26 01:51:39,653][__main__][INFO] - agents played in iteration 720 are Bob, Alice [2026-03-26 01:51:40,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:51:40,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:51:40,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:51:40,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:51:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:51:41,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:51:42,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:51:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:51:43,796][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:51:44,512][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:51:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:51:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:51:46,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:51:47,375][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:51:48,091][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:51:48,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:51:49,522][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:51:50,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:51:50,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:51:51,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:51:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:51:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:51:53,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:51:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:51:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:51:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:51:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:51:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:51:58,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:51:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:51:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:52:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:52:00,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:52:01,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:52:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:52:03,136][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:52:03,854][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:52:04,570][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:52:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:52:06,006][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:52:06,724][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:52:07,441][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:52:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:52:08,875][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:52:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:52:10,309][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:52:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:52:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:52:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:52:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:52:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:52:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:52:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:52:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:52:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:52:17,702][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:52:18,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:52:19,136][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:52:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:52:20,570][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:52:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:52:22,004][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:52:22,720][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:52:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:52:24,154][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:52:24,871][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:52:25,589][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:52:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:52:27,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:52:27,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:52:28,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:52:28,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:52:28,740][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:52:30,200][__main__][INFO] - Iteration 721 took 55s (8.98% Gen, 88.38% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 57m 41s. Estimated total time: 15h 25m 37s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 48s. [2026-03-26 01:52:30,203][__main__][INFO] - Starting iteration 721. [2026-03-26 01:52:30,207][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:52:30,208][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:52:35,444][__main__][INFO] - Number of regex retries in iteration 721: 0 [2026-03-26 01:52:35,445][__main__][INFO] - agents played in iteration 721 are Bob, Alice [2026-03-26 01:52:36,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:52:36,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:52:36,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:52:36,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:52:36,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:52:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:52:38,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:52:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:52:39,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:52:40,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:52:41,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:52:41,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:52:42,539][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:52:43,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:52:43,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:52:44,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:52:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:52:46,118][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:52:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:52:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:52:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:52:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:52:49,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:52:50,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:52:51,134][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:52:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:52:52,569][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:52:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:52:54,002][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:52:54,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:52:55,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:52:56,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:52:56,869][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:52:57,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:52:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:52:59,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:52:59,737][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:53:00,458][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:53:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:53:01,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:53:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:53:03,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:53:04,041][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:53:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:53:05,475][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:53:06,194][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:53:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:53:07,627][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:53:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:53:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:53:09,777][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:53:10,494][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:53:11,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:53:12,175][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:53:12,892][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:53:13,608][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:53:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:53:15,040][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:53:15,756][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:53:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:53:17,189][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:53:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:53:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:53:19,344][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:53:20,059][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:53:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:53:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:53:22,212][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:53:22,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:53:23,695][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:53:24,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:53:24,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:53:24,704][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:53:26,092][__main__][INFO] - Iteration 722 took 55s (9.37% Gen, 88.14% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 2m 35s. Estimated total time: 15h 31m 26s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 43s. [2026-03-26 01:53:26,096][__main__][INFO] - Starting iteration 722. [2026-03-26 01:53:26,103][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:53:26,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:53:31,026][__main__][INFO] - Number of regex retries in iteration 722: 0 [2026-03-26 01:53:31,027][__main__][INFO] - agents played in iteration 722 are Bob, Alice [2026-03-26 01:53:31,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:53:31,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:53:31,607][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:53:31,607][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:53:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:53:32,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:53:33,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:53:34,372][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:53:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:53:35,803][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:53:36,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:53:37,234][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:53:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:53:38,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:53:39,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:53:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:53:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:53:41,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:53:42,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:53:42,965][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:53:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:53:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:53:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:53:45,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:53:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:53:47,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:53:47,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:53:48,700][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:53:49,416][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:53:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:53:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:53:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:53:52,280][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:53:52,998][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:53:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:53:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:53:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:53:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:53:56,579][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:53:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:53:58,013][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:53:58,728][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:53:59,445][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:54:00,160][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:54:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:54:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:54:02,311][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:54:03,028][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:54:03,745][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:54:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:54:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:54:05,896][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:54:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:54:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:54:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:54:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:54:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:54:10,482][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:54:11,198][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:54:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:54:12,631][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:54:13,349][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:54:14,065][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:54:14,782][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:54:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:54:16,216][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:54:16,933][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:54:17,651][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:54:18,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:54:19,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:54:20,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:54:20,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:54:20,046][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:54:21,434][__main__][INFO] - Iteration 723 took 55s (8.90% Gen, 88.59% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 52m 27s. Estimated total time: 15h 22m 14s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 7s. [2026-03-26 01:54:21,438][__main__][INFO] - Starting iteration 723. [2026-03-26 01:54:21,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:54:21,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:54:31,919][__main__][INFO] - Number of regex retries in iteration 723: 0 [2026-03-26 01:54:31,920][__main__][INFO] - agents played in iteration 723 are Bob, Alice [2026-03-26 01:54:32,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:54:32,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:54:32,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:54:32,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:54:33,172][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:54:33,816][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:54:34,533][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:54:35,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:54:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:54:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:54:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:54:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:54:38,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:54:39,527][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:54:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:54:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:54:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:54:42,387][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:54:43,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:54:43,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:54:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:54:45,244][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:54:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:54:46,675][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:54:47,391][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:54:48,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:54:48,820][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:54:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:54:50,249][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:54:50,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:54:51,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:54:52,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:54:53,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:54:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:54:54,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:54:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:54:55,974][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:54:56,691][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:54:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:54:58,125][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:54:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:54:59,559][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:55:00,274][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:55:00,992][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:55:01,707][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:55:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:55:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:55:03,858][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:55:04,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:55:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:55:06,007][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:55:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:55:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:55:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:55:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:55:09,820][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:55:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:55:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:55:11,968][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:55:12,685][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:55:13,399][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:55:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:55:14,830][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:55:15,547][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:55:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:55:16,979][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:55:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:55:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:55:19,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:55:19,862][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:55:20,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:55:20,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:55:20,867][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:55:22,243][__main__][INFO] - Iteration 724 took 1m 0s (17.23% Gen, 80.50% Train). Generation: 10s, Training: 48s. Estimated remaining time: 5h 22m 34s. Estimated total time: 16h 53m 21s. Time estimates for 10 more iterations: 10m 8s, 100 more iterations: 1h 41m 20s, 500 more iterations: 8h 26m 40s. [2026-03-26 01:55:22,245][__main__][INFO] - Starting iteration 724. [2026-03-26 01:55:22,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:55:22,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:55:27,458][__main__][INFO] - Number of regex retries in iteration 724: 0 [2026-03-26 01:55:27,459][__main__][INFO] - agents played in iteration 724 are Bob, Alice [2026-03-26 01:55:27,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:55:28,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:55:28,024][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:55:28,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:55:28,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:55:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:55:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:55:30,791][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:55:31,505][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:55:32,221][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:55:32,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:55:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:55:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:55:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:55:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:55:36,517][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:55:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:55:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:55:38,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:55:39,379][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:55:40,096][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:55:40,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:55:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:55:42,243][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:55:42,960][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:55:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:55:44,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:55:45,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:55:45,826][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:55:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:55:47,258][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:55:47,974][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:55:48,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:55:49,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:55:50,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:55:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:55:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:55:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:55:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:55:53,710][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:55:54,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:55:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:55:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:55:56,575][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:55:57,292][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:55:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:55:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:55:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:56:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:56:00,874][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:56:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:56:02,306][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:56:03,274][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:56:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:56:04,707][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:56:05,422][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:56:06,140][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:56:06,855][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:56:07,573][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:56:08,288][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:56:09,007][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:56:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:56:10,441][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:56:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:56:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:56:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:56:13,306][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:56:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:56:14,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:56:15,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:56:16,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:56:16,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:56:16,464][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:56:21,001][__main__][INFO] - Iteration 725 took 58s (8.87% Gen, 83.41% Train). Generation: 5s, Training: 49s. Estimated remaining time: 4h 47m 27s. Estimated total time: 16h 19m 13s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 55s, 500 more iterations: 8h 9m 36s. [2026-03-26 01:56:21,004][__main__][INFO] - Starting iteration 725. [2026-03-26 01:56:21,008][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:56:21,009][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:56:25,935][__main__][INFO] - Number of regex retries in iteration 725: 0 [2026-03-26 01:56:25,936][__main__][INFO] - agents played in iteration 725 are Bob, Alice [2026-03-26 01:56:26,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:56:26,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:56:26,504][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:56:26,504][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:56:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:56:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:56:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:56:29,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:56:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:56:30,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:56:31,404][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:56:32,119][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:56:32,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:56:33,547][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:56:34,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:56:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:56:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:56:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:56:37,122][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:56:37,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:56:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:56:39,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:56:39,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:56:40,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:56:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:56:42,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:56:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:56:43,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:56:44,275][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:56:44,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:56:45,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:56:46,423][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:56:47,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:56:47,855][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:56:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:56:49,288][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:56:50,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:56:50,723][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:56:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:56:52,156][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:56:52,870][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:56:53,588][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:56:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:56:55,021][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:56:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:56:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:56:57,176][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:56:57,892][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:56:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:56:59,326][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:57:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:57:00,761][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:57:01,756][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:57:02,475][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:57:03,191][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:57:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:57:04,627][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:57:05,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:57:06,060][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:57:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:57:07,493][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:57:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:57:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:57:09,642][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:57:10,360][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:57:11,076][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:57:11,794][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:57:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:57:13,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:57:13,951][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:57:15,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:57:15,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:57:15,317][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:57:16,710][__main__][INFO] - Iteration 726 took 55s (8.85% Gen, 88.65% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 55m 41s. Estimated total time: 15h 28m 23s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 11s. [2026-03-26 01:57:16,715][__main__][INFO] - Starting iteration 726. [2026-03-26 01:57:16,719][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:57:16,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:57:21,626][__main__][INFO] - Number of regex retries in iteration 726: 0 [2026-03-26 01:57:21,627][__main__][INFO] - agents played in iteration 726 are Bob, Alice [2026-03-26 01:57:22,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:57:22,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:57:22,202][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:57:22,202][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:57:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:57:23,535][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:57:24,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:57:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:57:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:57:26,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:57:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:57:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:57:28,540][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:57:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:57:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:57:30,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:57:31,404][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:57:32,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:57:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:57:33,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:57:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:57:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:57:35,700][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:57:36,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:57:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:57:37,846][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:57:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:57:39,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:57:39,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:57:40,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:57:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:57:42,147][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:57:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:57:43,580][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:57:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:57:45,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:57:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:57:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:57:47,166][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:57:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:57:48,600][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:57:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:57:50,035][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:57:50,754][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:57:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:57:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:57:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:57:53,625][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:57:54,344][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:57:55,062][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:57:55,779][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:57:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:57:57,444][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:57:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:57:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:57:59,596][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:58:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:58:01,030][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:58:01,746][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:58:02,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:58:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:58:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:58:04,614][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:58:05,331][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:58:06,048][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:58:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:58:07,482][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:58:08,198][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:58:08,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:58:09,639][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:58:10,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:58:10,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:58:10,599][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:58:12,115][__main__][INFO] - Iteration 727 took 55s (8.86% Gen, 88.40% Train). Generation: 4s, Training: 48s. Estimated remaining time: 3h 49m 40s. Estimated total time: 15h 23m 17s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 38s. [2026-03-26 01:58:12,119][__main__][INFO] - Starting iteration 727. [2026-03-26 01:58:12,124][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:58:12,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:58:17,104][__main__][INFO] - Number of regex retries in iteration 727: 0 [2026-03-26 01:58:17,105][__main__][INFO] - agents played in iteration 727 are Bob, Alice [2026-03-26 01:58:17,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:58:17,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:58:17,720][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:58:17,721][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:58:18,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:58:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:58:19,762][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:58:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:58:21,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:58:21,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:58:22,623][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:58:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:58:24,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:58:24,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:58:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:58:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:58:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:58:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:58:28,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:58:29,074][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:58:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:58:30,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:58:31,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:58:31,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:58:32,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:58:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:58:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:58:34,806][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:58:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:58:36,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:58:36,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:58:37,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:58:38,391][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:58:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:58:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:58:40,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:58:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:58:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:58:42,691][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:58:43,407][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:58:44,122][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:58:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:58:45,556][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:58:46,273][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:58:46,989][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:58:47,707][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:58:48,422][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:58:49,142][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:58:49,858][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:58:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:58:51,295][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:58:52,014][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:58:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:58:53,710][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:58:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:58:55,146][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:58:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:58:56,583][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:58:57,302][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:58:58,020][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:58:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:58:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 01:59:00,176][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 01:59:00,897][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 01:59:01,617][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 01:59:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 01:59:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 01:59:03,776][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 01:59:04,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 01:59:05,301][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 01:59:06,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 01:59:06,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 01:59:06,332][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 01:59:07,652][__main__][INFO] - Iteration 728 took 55s (8.97% Gen, 88.65% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 50m 57s. Estimated total time: 15h 25m 30s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 45s. [2026-03-26 01:59:07,655][__main__][INFO] - Starting iteration 728. [2026-03-26 01:59:07,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 01:59:07,660][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 01:59:12,690][__main__][INFO] - Number of regex retries in iteration 728: 0 [2026-03-26 01:59:12,691][__main__][INFO] - agents played in iteration 728 are Bob, Alice [2026-03-26 01:59:13,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:59:16,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 01:59:17,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 01:59:17,273][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 01:59:18,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 01:59:19,507][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 01:59:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 01:59:20,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 01:59:21,654][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 01:59:22,370][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 01:59:23,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 01:59:23,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 01:59:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 01:59:25,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 01:59:25,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 01:59:26,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 01:59:27,368][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 01:59:28,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 01:59:28,795][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 01:59:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 01:59:30,227][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 01:59:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 01:59:31,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 01:59:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 01:59:33,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 01:59:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 01:59:34,524][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 01:59:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 01:59:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 01:59:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 01:59:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 01:59:38,107][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 01:59:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 01:59:39,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 01:59:40,256][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 01:59:40,974][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 01:59:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 01:59:42,406][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 01:59:43,122][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 01:59:43,838][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 01:59:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 01:59:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 01:59:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 01:59:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 01:59:47,416][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 01:59:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 01:59:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 01:59:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 01:59:50,282][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 01:59:50,998][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 01:59:51,717][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 01:59:52,432][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 01:59:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 01:59:54,147][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 01:59:54,864][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 01:59:55,580][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 01:59:56,298][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 01:59:57,013][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 01:59:57,731][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 01:59:58,448][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 01:59:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 01:59:59,881][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:00:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:00:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:00:02,030][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:00:02,747][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:00:03,462][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:00:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:00:04,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:00:05,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:47 [2026-03-26 02:00:06,769][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:00:06,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:00:06,775][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:00:08,079][__main__][INFO] - Iteration 729 took 1m 0s (8.33% Gen, 89.51% Train). Generation: 5s, Training: 54s. Estimated remaining time: 5h 11m 28s. Estimated total time: 16h 47m 1s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 42s, 500 more iterations: 8h 23m 30s. [2026-03-26 02:00:08,082][__main__][INFO] - Starting iteration 729. [2026-03-26 02:00:08,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:00:08,089][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:00:13,495][__main__][INFO] - Number of regex retries in iteration 729: 0 [2026-03-26 02:00:13,496][__main__][INFO] - agents played in iteration 729 are Bob, Alice [2026-03-26 02:00:14,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:00:14,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:00:14,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:00:14,077][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:00:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:00:15,404][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:00:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:00:16,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:00:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:00:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:00:18,977][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:00:19,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:00:20,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:00:21,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:00:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:00:22,551][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:00:23,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:00:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:00:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:00:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:00:26,128][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:00:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:00:27,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:00:28,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:00:28,991][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:00:29,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:00:30,423][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:00:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:00:31,855][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:00:32,571][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:00:33,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:00:34,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:00:34,720][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:00:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:00:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:00:36,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:00:37,587][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:00:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:00:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:00:39,739][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:00:40,455][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:00:41,172][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:00:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:00:42,605][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:00:43,319][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:00:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:00:44,752][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:00:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:00:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:00:46,903][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:00:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:00:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:00:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:00:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:00:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:00:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:00:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:00:52,862][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:00:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:00:54,294][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:00:55,011][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:00:55,726][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:00:56,444][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:00:57,160][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:00:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:00:58,594][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:00:59,312][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:01:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:01:00,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:01:01,468][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:01:02,562][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:01:02,564][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:01:02,566][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:01:04,015][__main__][INFO] - Iteration 730 took 55s (9.67% Gen, 87.74% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 55m 39s. Estimated total time: 15h 32m 9s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 4s. [2026-03-26 02:01:04,019][__main__][INFO] - Starting iteration 730. [2026-03-26 02:01:04,025][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:01:04,027][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:01:09,082][__main__][INFO] - Number of regex retries in iteration 730: 0 [2026-03-26 02:01:09,083][__main__][INFO] - agents played in iteration 730 are Bob, Alice [2026-03-26 02:01:09,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:01:09,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:01:09,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:01:09,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:01:10,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:01:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:01:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:01:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:01:13,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:01:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:01:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:01:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:01:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:01:16,706][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:01:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:01:18,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:01:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:01:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:01:20,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:01:21,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:01:21,719][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:01:22,435][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:01:23,152][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:01:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:01:24,584][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:01:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:01:26,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:01:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:01:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:01:28,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:01:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:01:29,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:01:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:01:31,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:01:31,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:01:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:01:33,189][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:01:33,906][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:01:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:01:35,338][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:01:36,053][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:01:36,771][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:01:37,485][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:01:38,204][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:01:38,919][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:01:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:01:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:01:41,070][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:01:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:01:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:01:43,221][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:01:43,941][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:01:44,903][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:01:45,620][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:01:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:01:47,052][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:01:47,769][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:01:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:01:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:01:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:01:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:01:51,352][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:01:52,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:01:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:01:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:01:54,222][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:01:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:01:55,656][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:01:56,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:01:57,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:01:58,125][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:01:58,127][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:01:58,128][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:01:59,453][__main__][INFO] - Iteration 731 took 55s (9.12% Gen, 88.48% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 46m 25s. Estimated total time: 15h 23m 50s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 55s. [2026-03-26 02:01:59,457][__main__][INFO] - Starting iteration 731. [2026-03-26 02:01:59,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:01:59,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:02:04,446][__main__][INFO] - Number of regex retries in iteration 731: 0 [2026-03-26 02:02:04,447][__main__][INFO] - agents played in iteration 731 are Bob, Alice [2026-03-26 02:02:04,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:02:05,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:02:05,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:02:05,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:02:05,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:02:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:02:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:02:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:02:08,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:02:09,265][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:02:09,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:02:10,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:02:11,415][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:02:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:02:12,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:02:13,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:02:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:02:14,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:02:15,712][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:02:16,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:02:17,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:02:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:02:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:02:19,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:02:20,012][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:02:20,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:02:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:02:22,164][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:02:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:02:23,597][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:02:24,316][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:02:25,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:02:25,752][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:02:26,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:02:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:02:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:02:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:02:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:02:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:02:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:02:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:02:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:02:32,925][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:02:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:02:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:02:35,073][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:02:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:02:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:02:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:02:37,940][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:02:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:02:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:02:40,374][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:02:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:02:41,808][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:02:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:02:43,242][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:02:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:02:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:02:45,394][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:02:46,110][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:02:46,827][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:02:47,543][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:02:48,260][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:02:48,978][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:02:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:02:50,413][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:02:51,129][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:02:51,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:02:52,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:02:54,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:02:54,011][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:02:54,013][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:02:55,449][__main__][INFO] - Iteration 732 took 55s (8.90% Gen, 88.53% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 54m 48s. Estimated total time: 15h 33m 9s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 34s. [2026-03-26 02:02:55,452][__main__][INFO] - Starting iteration 732. [2026-03-26 02:02:55,456][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:02:55,456][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:03:00,382][__main__][INFO] - Number of regex retries in iteration 732: 0 [2026-03-26 02:03:00,383][__main__][INFO] - agents played in iteration 732 are Bob, Alice [2026-03-26 02:03:00,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:03:00,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:03:00,954][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:03:00,955][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:03:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:03:02,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:03:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:03:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:03:04,443][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:03:05,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:03:05,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:03:06,592][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:03:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:03:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:03:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:03:09,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:03:10,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:03:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:03:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:03:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:03:13,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:03:13,758][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:03:14,475][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:03:15,192][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:03:15,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:03:16,627][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:03:17,343][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:03:18,065][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:03:18,783][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:03:19,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:03:20,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:03:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:03:21,653][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:03:22,370][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:03:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:03:23,809][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:03:24,527][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:03:25,244][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:03:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:03:26,679][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:03:27,397][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:03:28,113][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:03:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:03:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:03:30,268][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:03:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:03:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:03:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:03:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:03:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:03:34,575][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:03:35,293][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:03:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:03:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:03:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:03:38,403][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:03:39,122][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:03:39,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:03:40,558][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:03:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:03:41,994][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:03:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:03:43,430][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:03:44,150][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:03:44,867][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:03:45,587][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:03:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:03:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:03:47,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:03:48,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:03:49,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:03:49,466][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:03:49,467][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:03:50,829][__main__][INFO] - Iteration 733 took 55s (8.90% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 43m 38s. Estimated total time: 15h 22m 55s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 27s. [2026-03-26 02:03:50,831][__main__][INFO] - Starting iteration 733. [2026-03-26 02:03:50,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:03:50,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:03:57,741][__main__][INFO] - Number of regex retries in iteration 733: 0 [2026-03-26 02:03:57,742][__main__][INFO] - agents played in iteration 733 are Bob, Alice [2026-03-26 02:03:58,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:03:58,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:03:58,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:03:58,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:03:59,013][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:03:59,657][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:04:00,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:04:01,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:04:01,803][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:04:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:04:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:04:03,948][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:04:04,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:04:05,379][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:04:06,096][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:04:06,810][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:04:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:04:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:04:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:04:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:04:10,391][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:04:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:04:11,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:04:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:04:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:04:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:04:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:04:15,400][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:04:16,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:04:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:04:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:04:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:04:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:04:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:04:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:04:21,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:04:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:04:22,567][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:04:23,283][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:04:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:04:24,717][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:04:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:04:26,150][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:04:26,869][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:04:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:04:28,306][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:04:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:04:29,742][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:04:30,458][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:04:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:04:31,893][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:04:32,612][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:04:33,558][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:04:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:04:34,992][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:04:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:04:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:04:37,144][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:04:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:04:38,577][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:04:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:04:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:04:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:04:41,447][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:04:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:04:42,881][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:04:43,598][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:04:44,316][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:04:45,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:04:45,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:04:46,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:04:46,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:04:46,982][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:04:48,427][__main__][INFO] - Iteration 734 took 57s (11.99% Gen, 85.50% Train). Generation: 6s, Training: 49s. Estimated remaining time: 4h 19m 39s. Estimated total time: 15h 59m 53s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 59s, 500 more iterations: 7h 59m 56s. [2026-03-26 02:04:48,464][__main__][INFO] - Starting iteration 734. [2026-03-26 02:04:48,468][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:04:48,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:04:53,869][__main__][INFO] - Number of regex retries in iteration 734: 0 [2026-03-26 02:04:53,870][__main__][INFO] - agents played in iteration 734 are Bob, Alice [2026-03-26 02:04:54,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:04:54,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:04:54,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:04:54,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:04:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:04:55,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:04:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:04:57,208][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:04:57,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:04:58,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:04:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:05:00,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:05:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:05:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:05:02,215][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:05:02,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:05:03,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:05:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:05:05,081][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:05:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:05:06,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:05:07,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:05:07,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:05:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:05:09,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:05:10,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:05:10,824][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:05:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:05:12,257][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:05:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:05:13,693][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:05:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:05:15,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:05:15,848][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:05:16,566][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:05:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:05:18,002][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:05:18,722][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:05:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:05:20,157][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:05:20,876][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:05:21,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:05:22,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:05:23,033][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:05:23,749][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:05:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:05:25,185][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:05:25,903][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:05:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:05:27,338][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:05:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:05:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:05:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:05:30,535][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:05:31,252][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:05:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:05:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:05:33,404][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:05:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:05:34,839][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:05:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:05:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:05:36,991][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:05:37,711][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:05:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:05:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:05:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:05:40,585][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:05:41,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:05:42,066][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:05:43,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:05:43,045][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:05:43,047][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:05:44,558][__main__][INFO] - Iteration 735 took 56s (9.63% Gen, 87.67% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 53m 41s. Estimated total time: 15h 34m 51s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 25s. [2026-03-26 02:05:44,561][__main__][INFO] - Starting iteration 735. [2026-03-26 02:05:44,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:05:44,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:05:49,699][__main__][INFO] - Number of regex retries in iteration 735: 0 [2026-03-26 02:05:49,700][__main__][INFO] - agents played in iteration 735 are Bob, Alice [2026-03-26 02:05:50,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:05:50,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:05:50,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:05:50,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:05:51,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:05:51,755][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:05:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:05:53,191][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:05:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:05:54,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:05:55,343][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:05:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:05:56,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:05:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:05:58,218][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:05:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:05:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:06:00,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:06:01,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:06:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:06:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:06:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:06:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:06:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:06:05,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:06:06,123][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:06:06,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:06:07,565][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:06:08,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:06:09,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:06:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:06:10,445][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:06:11,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:06:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:06:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:06:13,321][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:06:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:06:14,755][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:06:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:06:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:06:16,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:06:17,619][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:06:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:06:19,052][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:06:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:06:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:06:21,200][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:06:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:06:22,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:06:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:06:24,063][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:06:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:06:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:06:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:06:27,156][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:06:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:06:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:06:29,303][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:06:30,021][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:06:30,736][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:06:31,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:06:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:06:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:06:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:06:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:06:35,036][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:06:35,752][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:06:36,470][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:06:37,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:06:37,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:06:38,910][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:06:38,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:06:38,914][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:06:40,305][__main__][INFO] - Iteration 736 took 55s (9.21% Gen, 88.29% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 46m 55s. Estimated total time: 15h 29m 1s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 54s, 500 more iterations: 7h 44m 30s. [2026-03-26 02:06:40,309][__main__][INFO] - Starting iteration 736. [2026-03-26 02:06:40,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:06:40,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:06:45,255][__main__][INFO] - Number of regex retries in iteration 736: 0 [2026-03-26 02:06:45,256][__main__][INFO] - agents played in iteration 736 are Bob, Alice [2026-03-26 02:06:45,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:06:45,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:06:45,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:06:45,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:06:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:06:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:06:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:06:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:06:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:06:50,068][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:06:50,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:06:51,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:06:52,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:06:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:06:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:06:54,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:06:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:06:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:06:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:06:57,226][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:06:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:06:58,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:06:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:07:00,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:07:00,811][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:07:01,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:07:02,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:07:02,959][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:07:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:07:04,393][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:07:05,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:07:05,827][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:07:06,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:07:07,261][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:07:07,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:07:08,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:07:09,412][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:07:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:07:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:07:11,563][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:07:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:07:12,997][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:07:13,713][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:07:14,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:07:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:07:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:07:16,583][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:07:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:07:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:07:18,738][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:07:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:07:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:07:21,133][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:07:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:07:22,571][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:07:23,289][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:07:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:07:24,721][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:07:25,437][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:07:26,155][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:07:26,871][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:07:27,589][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:07:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:07:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:07:29,739][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:07:30,455][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:07:31,172][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:07:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:07:32,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:07:33,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:07:34,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:07:34,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:07:34,450][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:07:35,807][__main__][INFO] - Iteration 737 took 55s (8.91% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 41m 55s. Estimated total time: 15h 24m 56s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 28s. [2026-03-26 02:07:35,810][__main__][INFO] - Starting iteration 737. [2026-03-26 02:07:35,815][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:07:35,816][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:07:40,797][__main__][INFO] - Number of regex retries in iteration 737: 0 [2026-03-26 02:07:40,798][__main__][INFO] - agents played in iteration 737 are Bob, Alice [2026-03-26 02:07:41,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:07:41,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:07:41,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:07:41,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:07:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:07:42,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:07:43,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:07:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:07:44,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:07:45,582][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:07:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:07:47,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:07:47,729][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:07:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:07:49,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:07:49,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:07:50,592][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:07:51,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:07:52,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:07:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:07:53,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:07:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:07:54,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:07:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:07:56,323][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:07:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:07:57,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:07:58,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:07:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:07:59,905][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:08:00,622][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:08:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:08:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:08:02,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:08:03,492][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:08:04,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:08:04,925][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:08:05,644][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:08:06,360][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:08:07,076][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:08:07,792][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:08:08,509][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:08:09,226][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:08:09,943][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:08:10,660][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:08:11,377][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:08:12,093][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:08:12,810][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:08:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:08:14,244][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:08:14,963][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:08:15,680][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:08:16,700][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:08:17,419][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:08:18,135][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:08:18,851][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:08:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:08:20,285][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:08:21,003][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:08:21,720][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:08:22,438][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:08:23,154][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:08:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:08:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:08:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:08:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:08:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:08:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:08:28,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:08:28,926][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:08:29,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:08:29,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:08:29,972][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:08:31,297][__main__][INFO] - Iteration 738 took 55s (8.98% Gen, 88.62% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 40m 48s. Estimated total time: 15h 24m 45s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 22s. [2026-03-26 02:08:31,300][__main__][INFO] - Starting iteration 738. [2026-03-26 02:08:31,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:08:31,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:08:36,307][__main__][INFO] - Number of regex retries in iteration 738: 0 [2026-03-26 02:08:36,308][__main__][INFO] - agents played in iteration 738 are Bob, Alice [2026-03-26 02:08:36,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:08:36,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:08:36,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:08:36,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:08:37,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:08:38,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:08:38,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:08:39,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:08:40,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:08:41,062][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:08:41,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:08:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:08:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:08:43,926][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:08:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:08:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:08:46,073][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:08:46,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:08:47,505][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:08:48,220][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:08:48,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:08:49,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:08:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:08:51,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:08:51,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:08:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:08:53,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:08:53,954][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:08:54,669][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:08:55,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:08:56,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:08:56,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:08:57,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:08:58,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:08:58,973][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:08:59,693][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:09:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:09:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:09:01,843][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:09:02,562][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:09:03,279][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:09:03,997][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:09:04,712][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:09:05,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:09:06,147][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:09:06,863][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:09:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:09:08,296][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:09:09,013][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:09:09,729][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:09:10,445][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:09:11,161][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:09:12,106][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:09:12,822][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:09:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:09:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:09:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:09:15,689][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:09:16,406][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:09:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:09:17,841][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:09:18,557][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:09:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:09:19,992][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:09:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:09:21,429][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:09:22,145][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:09:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:09:23,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:09:24,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:09:25,259][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:09:25,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:09:25,263][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:09:26,721][__main__][INFO] - Iteration 739 took 55s (9.03% Gen, 88.33% Train). Generation: 5s, Training: 48s. Estimated remaining time: 3h 38m 47s. Estimated total time: 15h 23m 39s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 49s. [2026-03-26 02:09:26,726][__main__][INFO] - Starting iteration 739. [2026-03-26 02:09:26,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:09:26,752][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:09:32,180][__main__][INFO] - Number of regex retries in iteration 739: 0 [2026-03-26 02:09:32,181][__main__][INFO] - agents played in iteration 739 are Bob, Alice [2026-03-26 02:09:32,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:09:32,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:09:32,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:09:32,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:09:33,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:09:34,091][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:09:34,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:09:35,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:09:36,244][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:09:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:09:37,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:09:38,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:09:39,113][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:09:39,830][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:09:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:09:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:09:41,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:09:42,692][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:09:43,407][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:09:44,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:09:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:09:45,555][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:09:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:09:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:09:47,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:09:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:09:49,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:09:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:09:50,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:09:51,285][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:09:52,003][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:09:52,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:09:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:09:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:09:54,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:09:55,587][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:09:56,305][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:09:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:09:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:09:58,457][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:09:59,174][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:09:59,891][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:10:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:10:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:10:02,042][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:10:02,758][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:10:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:10:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:10:04,908][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:10:05,624][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:10:06,342][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:10:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:10:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:10:08,720][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:10:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:10:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:10:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:10:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:10:12,305][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:10:13,021][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:10:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:10:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:10:15,176][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:10:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:10:16,611][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:10:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:10:18,044][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:10:18,761][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:10:19,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:10:20,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:10:21,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:10:21,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:10:21,194][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:10:22,534][__main__][INFO] - Iteration 740 took 55s (9.73% Gen, 87.86% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 43m 56s. Estimated total time: 15h 29m 45s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 52s. [2026-03-26 02:10:22,537][__main__][INFO] - Starting iteration 740. [2026-03-26 02:10:22,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:10:22,543][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:10:27,467][__main__][INFO] - Number of regex retries in iteration 740: 0 [2026-03-26 02:10:27,468][__main__][INFO] - agents played in iteration 740 are Bob, Alice [2026-03-26 02:10:27,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:10:28,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:10:28,036][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:10:28,037][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:10:28,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:10:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:10:30,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:10:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:10:31,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:10:32,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:10:32,961][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:10:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:10:34,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:10:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:10:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:10:36,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:10:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:10:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:10:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:10:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:10:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:10:40,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:10:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:10:42,274][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:10:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:10:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:10:44,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:10:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:10:45,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:10:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:10:47,290][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:10:48,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:10:48,723][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:10:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:10:50,157][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:10:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:10:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:10:52,309][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:10:53,025][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:10:53,742][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:10:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:10:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:10:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:10:56,607][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:10:57,322][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:10:58,039][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:10:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:10:59,473][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:11:00,189][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:11:00,908][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:11:01,624][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:11:02,341][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:11:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:11:04,082][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:11:04,799][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:11:05,517][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:11:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:11:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:11:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:11:08,383][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:11:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:11:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:11:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:11:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:11:11,972][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:11:12,689][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:11:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:11:14,122][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:11:14,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:11:15,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:11:16,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:11:16,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:11:16,538][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:11:17,884][__main__][INFO] - Iteration 741 took 55s (8.90% Gen, 88.66% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 35m 40s. Estimated total time: 15h 22m 23s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 11s. [2026-03-26 02:11:17,887][__main__][INFO] - Starting iteration 741. [2026-03-26 02:11:17,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:11:17,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:11:23,028][__main__][INFO] - Number of regex retries in iteration 741: 0 [2026-03-26 02:11:23,029][__main__][INFO] - agents played in iteration 741 are Bob, Alice [2026-03-26 02:11:23,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:11:23,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:11:23,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:11:23,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:11:24,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:11:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:11:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:11:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:11:27,077][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:11:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:11:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:11:29,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:11:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:11:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:11:31,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:11:32,086][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:11:32,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:11:33,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:11:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:11:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:11:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:11:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:11:37,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:11:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:11:38,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:11:39,251][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:11:39,968][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:11:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:11:41,402][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:11:42,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:11:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:11:43,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:11:44,264][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:11:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:11:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:11:46,414][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:11:47,130][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:11:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:11:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:11:49,281][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:11:49,998][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:11:50,714][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:11:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:11:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:11:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:11:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:11:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:11:55,019][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:11:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:11:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:11:57,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:11:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:11:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:11:59,547][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:12:00,264][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:12:00,980][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:12:01,699][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:12:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:12:03,134][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:12:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:12:04,567][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:12:05,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:12:06,006][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:12:06,725][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:12:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:12:08,160][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:12:08,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:12:09,594][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:12:10,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:12:11,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:12:12,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:12:12,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:12:12,059][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:12:13,578][__main__][INFO] - Iteration 742 took 55s (9.23% Gen, 88.04% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 40m 29s. Estimated total time: 15h 28m 8s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 48s, 500 more iterations: 7h 44m 4s. [2026-03-26 02:12:13,581][__main__][INFO] - Starting iteration 742. [2026-03-26 02:12:13,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:12:13,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:12:18,521][__main__][INFO] - Number of regex retries in iteration 742: 0 [2026-03-26 02:12:18,522][__main__][INFO] - agents played in iteration 742 are Bob, Alice [2026-03-26 02:12:19,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:12:19,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:12:19,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:12:19,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:12:19,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:12:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:12:21,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:12:21,877][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:12:22,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:12:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:12:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:12:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:12:25,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:12:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:12:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:12:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:12:28,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:12:29,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:12:29,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:12:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:12:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:12:31,906][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:12:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:12:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:12:34,057][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:12:34,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:12:35,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:12:36,211][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:12:36,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:12:37,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:12:38,365][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:12:39,084][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:12:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:12:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:12:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:12:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:12:42,671][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:12:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:12:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:12:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:12:45,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:12:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:12:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:12:47,690][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:12:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:12:49,123][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:12:49,839][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:12:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:12:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:12:51,990][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:12:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:12:53,424][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:12:54,365][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:12:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:12:55,800][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:12:56,516][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:12:57,234][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:12:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:12:58,669][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:12:59,386][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:13:00,104][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:13:00,822][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:13:01,540][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:13:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:13:02,976][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:13:03,693][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:13:04,411][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:13:05,127][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:13:05,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:13:06,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:13:07,512][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:13:07,515][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:13:07,516][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:13:08,869][__main__][INFO] - Iteration 743 took 55s (8.93% Gen, 88.62% Train). Generation: 4s, Training: 48s. Estimated remaining time: 3h 32m 51s. Estimated total time: 15h 21m 26s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 43s. [2026-03-26 02:13:08,871][__main__][INFO] - Starting iteration 743. [2026-03-26 02:13:08,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:13:08,875][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:13:13,821][__main__][INFO] - Number of regex retries in iteration 743: 0 [2026-03-26 02:13:13,822][__main__][INFO] - agents played in iteration 743 are Bob, Alice [2026-03-26 02:13:14,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:13:14,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:13:14,482][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:13:14,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:13:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:13:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:13:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:13:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:13:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:13:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:13:19,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:13:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:13:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:13:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:13:22,272][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:13:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:13:23,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:13:24,421][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:13:25,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:13:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:13:26,571][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:13:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:13:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:13:28,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:13:29,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:13:30,155][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:13:30,873][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:13:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:13:32,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:13:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:13:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:13:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:13:35,178][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:13:35,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:13:36,613][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:13:37,333][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:13:38,051][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:13:38,768][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:13:39,486][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:13:40,202][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:13:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:13:41,635][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:13:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:13:43,068][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:13:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:13:44,500][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:13:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:13:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:13:46,651][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:13:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:13:48,083][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:13:48,802][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:13:49,804][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:13:50,522][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:13:51,242][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:13:51,959][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:13:52,676][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:13:53,394][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:13:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:13:54,829][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:13:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:13:56,262][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:13:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:13:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:13:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:13:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:13:59,850][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:14:00,569][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:14:01,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:14:02,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:14:02,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:14:02,991][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:14:02,993][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:14:04,299][__main__][INFO] - Iteration 744 took 55s (8.92% Gen, 88.71% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 34m 16s. Estimated total time: 15h 23m 46s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 53s. [2026-03-26 02:14:04,306][__main__][INFO] - Starting iteration 744. [2026-03-26 02:14:04,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:14:04,311][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:14:10,684][__main__][INFO] - Number of regex retries in iteration 744: 0 [2026-03-26 02:14:10,685][__main__][INFO] - agents played in iteration 744 are Bob, Alice [2026-03-26 02:14:11,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:14:11,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:14:11,588][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:14:11,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:14:12,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:14:12,916][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:14:13,633][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:14:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:14:15,063][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:14:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:14:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:14:17,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:14:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:14:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:14:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:14:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:14:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:14:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:14:22,216][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:14:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:14:23,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:14:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:14:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:14:25,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:14:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:14:27,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:14:27,944][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:14:28,661][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:14:29,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:14:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:14:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:14:31,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:14:32,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:14:32,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:14:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:14:34,394][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:14:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:14:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:14:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:14:37,262][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:14:37,980][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:14:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:14:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:14:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:14:40,849][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:14:41,566][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:14:42,286][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:14:43,003][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:14:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:14:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:14:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:14:45,871][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:14:46,819][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:14:47,536][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:14:48,251][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:14:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:14:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:14:50,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:14:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:14:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:14:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:14:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:14:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:14:54,700][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:14:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:14:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:14:56,850][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:14:57,568][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:14:58,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:14:59,016][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:15:00,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:15:00,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:15:00,063][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:15:01,554][__main__][INFO] - Iteration 745 took 57s (11.13% Gen, 86.26% Train). Generation: 6s, Training: 49s. Estimated remaining time: 4h 3m 38s. Estimated total time: 15h 54m 5s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 24s, 500 more iterations: 7h 57m 2s. [2026-03-26 02:15:01,556][__main__][INFO] - Starting iteration 745. [2026-03-26 02:15:01,560][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:15:01,560][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:15:06,572][__main__][INFO] - Number of regex retries in iteration 745: 0 [2026-03-26 02:15:06,573][__main__][INFO] - agents played in iteration 745 are Bob, Alice [2026-03-26 02:15:07,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:15:07,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:15:07,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:15:07,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:15:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:15:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:15:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:15:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:15:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:15:11,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:15:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:15:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:15:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:15:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:15:15,010][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:15:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:15:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:15:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:15:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:15:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:15:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:15:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:15:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:15:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:15:22,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:15:22,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:15:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:15:24,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:15:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:15:25,750][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:15:26,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:15:27,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:15:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:15:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:15:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:15:30,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:15:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:15:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:15:32,206][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:15:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:15:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:15:34,355][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:15:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:15:35,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:15:36,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:15:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:15:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:15:38,655][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:15:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:15:40,089][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:15:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:15:41,526][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:15:42,482][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:15:43,201][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:15:43,920][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:15:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:15:45,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:15:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:15:46,795][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:15:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:15:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:15:48,951][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:15:49,669][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:15:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:15:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:15:51,824][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:15:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:15:53,259][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:15:53,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:15:54,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:15:55,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:15:55,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:15:55,740][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:15:57,130][__main__][INFO] - Iteration 746 took 55s (9.02% Gen, 88.47% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 34m 48s. Estimated total time: 15h 26m 11s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 37s, 500 more iterations: 7h 43m 5s. [2026-03-26 02:15:57,132][__main__][INFO] - Starting iteration 746. [2026-03-26 02:15:57,137][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:15:57,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:16:02,131][__main__][INFO] - Number of regex retries in iteration 746: 0 [2026-03-26 02:16:02,132][__main__][INFO] - agents played in iteration 746 are Bob, Alice [2026-03-26 02:16:02,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:16:02,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:16:02,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:16:02,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:16:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:16:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:16:04,777][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:16:05,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:16:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:16:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:16:07,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:16:08,363][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:16:09,080][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:16:09,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:16:10,516][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:16:11,233][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:16:11,952][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:16:12,669][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:16:13,388][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:16:14,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:16:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:16:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:16:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:16:16,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:16:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:16:18,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:16:19,132][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:16:19,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:16:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:16:21,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:16:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:16:22,724][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:16:23,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:16:24,160][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:16:24,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:16:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:16:26,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:16:27,031][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:16:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:16:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:16:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:16:29,902][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:16:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:16:31,338][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:16:32,056][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:16:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:16:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:16:34,213][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:16:34,931][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:16:35,648][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:16:36,366][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:16:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:16:38,052][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:16:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:16:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:16:40,206][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:16:40,925][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:16:41,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:16:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:16:43,080][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:16:43,800][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:16:44,517][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:16:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:16:45,956][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:16:46,674][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:16:47,392][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:16:48,111][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:16:48,829][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:16:49,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:16:50,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:16:51,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:16:51,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:16:51,430][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:16:52,782][__main__][INFO] - Iteration 747 took 55s (8.98% Gen, 88.59% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 35m 9s. Estimated total time: 15h 27m 27s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 43s. [2026-03-26 02:16:52,784][__main__][INFO] - Starting iteration 747. [2026-03-26 02:16:52,791][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:16:52,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:16:59,749][__main__][INFO] - Number of regex retries in iteration 747: 0 [2026-03-26 02:16:59,750][__main__][INFO] - agents played in iteration 747 are Bob, Alice [2026-03-26 02:17:00,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:17:00,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:17:00,329][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:17:00,330][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:17:01,037][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:17:01,681][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:17:02,400][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:17:03,117][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:17:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:17:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:17:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:17:05,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:17:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:17:07,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:17:08,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:17:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:17:09,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:17:10,292][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:17:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:17:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:17:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:17:13,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:17:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:17:14,596][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:17:15,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:17:16,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:17:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:17:17,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:17:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:17:18,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:17:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:17:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:17:21,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:17:21,767][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:17:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:17:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:17:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:17:24,642][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:17:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:17:26,080][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:17:26,797][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:17:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:17:28,232][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:17:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:17:29,667][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:17:30,383][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:17:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:17:31,822][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:17:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:17:33,257][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:17:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:17:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:17:35,672][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:17:36,389][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:17:37,106][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:17:37,823][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:17:38,541][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:17:39,260][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:17:39,979][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:17:40,696][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:17:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:17:42,129][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:17:42,845][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:17:43,563][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:17:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:17:44,996][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:17:45,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:17:46,430][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:17:47,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:17:47,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:17:48,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:17:48,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:17:48,925][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:17:50,284][__main__][INFO] - Iteration 748 took 57s (12.10% Gen, 85.53% Train). Generation: 6s, Training: 49s. Estimated remaining time: 4h 4m 59s. Estimated total time: 15h 58m 15s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 49s, 500 more iterations: 7h 59m 7s. [2026-03-26 02:17:50,286][__main__][INFO] - Starting iteration 748. [2026-03-26 02:17:50,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:17:50,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:17:55,221][__main__][INFO] - Number of regex retries in iteration 748: 0 [2026-03-26 02:17:55,222][__main__][INFO] - agents played in iteration 748 are Bob, Alice [2026-03-26 02:17:55,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:17:55,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:17:55,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:17:55,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:17:56,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:17:57,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:17:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:17:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:17:59,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:17:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:18:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:18:01,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:18:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:18:02,851][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:18:03,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:18:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:18:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:18:05,713][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:18:06,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:18:07,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:18:07,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:18:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:18:09,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:18:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:18:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:18:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:18:12,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:18:12,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:18:13,595][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:18:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:18:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:18:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:18:16,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:18:17,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:18:17,897][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:18:18,616][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:18:19,332][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:18:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:18:20,764][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:18:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:18:22,197][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:18:22,913][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:18:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:18:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:18:25,062][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:18:25,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:18:26,495][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:18:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:18:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:18:28,645][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:18:29,362][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:18:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:18:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:18:31,736][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:18:32,452][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:18:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:18:33,886][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:18:34,603][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:18:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:18:36,037][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:18:36,753][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:18:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:18:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:18:38,903][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:18:39,622][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:18:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:18:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:18:41,775][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:18:42,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:18:43,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:18:44,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:18:44,328][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:18:44,329][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:18:45,742][__main__][INFO] - Iteration 749 took 55s (8.89% Gen, 88.56% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 30m 2s. Estimated total time: 15h 24m 13s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 6s. [2026-03-26 02:18:45,745][__main__][INFO] - Starting iteration 749. [2026-03-26 02:18:45,750][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:18:45,750][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:18:51,221][__main__][INFO] - Number of regex retries in iteration 749: 0 [2026-03-26 02:18:51,222][__main__][INFO] - agents played in iteration 749 are Bob, Alice [2026-03-26 02:18:51,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:18:51,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:18:51,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:18:51,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:18:52,472][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:18:53,117][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:18:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:18:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:18:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:18:55,982][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:18:56,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:18:57,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:18:58,130][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:18:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:18:59,561][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:19:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:19:00,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:19:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:19:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:19:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:19:03,858][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:19:04,573][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:19:05,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:19:06,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:19:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:19:07,441][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:19:08,156][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:19:08,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:19:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:19:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:19:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:19:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:19:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:19:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:19:13,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:19:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:19:15,330][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:19:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:19:16,765][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:19:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:19:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:19:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:19:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:19:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:19:21,066][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:19:21,782][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:19:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:19:23,215][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:19:23,932][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:19:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:19:25,366][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:19:26,083][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:19:27,044][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:19:27,762][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:19:28,479][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:19:29,196][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:19:29,913][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:19:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:19:31,348][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:19:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:19:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:19:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:19:34,215][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:19:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:19:35,648][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:19:36,366][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:19:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:19:37,801][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:19:38,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:19:39,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:19:40,240][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:19:40,242][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:19:40,244][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:19:41,740][__main__][INFO] - Iteration 750 took 55s (9.77% Gen, 87.55% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 38m 5s. Estimated total time: 15h 33m 12s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 36s. [2026-03-26 02:19:41,742][__main__][INFO] - Starting iteration 750. [2026-03-26 02:19:41,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2026-03-26 02:19:41,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:19:46,748][__main__][INFO] - Number of regex retries in iteration 750: 0 [2026-03-26 02:19:46,749][__main__][INFO] - agents played in iteration 750 are Bob, Alice [2026-03-26 02:19:47,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:19:47,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:19:47,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:19:47,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:19:48,030][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:19:48,676][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:19:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:19:50,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:19:50,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:19:51,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:19:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:19:52,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:19:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:19:54,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:19:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:19:55,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:19:56,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:19:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:19:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:19:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:19:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:20:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:20:00,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:20:01,570][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:20:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:20:03,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:20:03,721][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:20:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:20:05,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:20:05,872][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:20:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:20:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:20:08,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:20:08,742][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:20:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:20:10,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:20:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:20:11,610][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:20:12,327][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:20:13,042][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:20:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:20:14,476][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:20:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:20:15,908][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:20:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:20:17,344][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:20:18,059][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:20:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:20:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:20:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:20:20,930][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:20:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:20:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:20:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:20:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:20:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:20:25,497][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:20:26,214][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:20:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:20:27,649][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:20:28,366][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:20:29,086][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:20:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:20:30,519][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:20:31,235][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:20:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:20:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:20:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:20:34,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:20:34,831][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:20:35,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:20:35,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:20:35,949][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:20:38,616][__main__][INFO] - Iteration 751 took 56s (8.80% Gen, 86.51% Train). Generation: 5s, Training: 49s. Estimated remaining time: 3h 51m 47s. Estimated total time: 15h 47m 52s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 47s, 500 more iterations: 7h 53m 56s. [2026-03-26 02:20:38,619][__main__][INFO] - Starting iteration 751. [2026-03-26 02:20:38,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:20:38,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:20:43,552][__main__][INFO] - Number of regex retries in iteration 751: 0 [2026-03-26 02:20:43,554][__main__][INFO] - agents played in iteration 751 are Bob, Alice [2026-03-26 02:20:44,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:20:44,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:20:44,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:20:44,217][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:20:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:20:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:20:46,278][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:20:46,993][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:20:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:20:48,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:20:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:20:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:20:50,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:20:51,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:20:51,997][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:20:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:20:53,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:20:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:20:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:20:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:20:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:20:57,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:20:57,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:20:58,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:20:59,162][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:20:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:21:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:21:01,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:21:02,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:21:02,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:21:03,459][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:21:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:21:04,891][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:21:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:21:06,325][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:21:07,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:21:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:21:08,474][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:21:09,192][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:21:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:21:10,626][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:21:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:21:12,060][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:21:12,776][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:21:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:21:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:21:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:21:15,645][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:21:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:21:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:21:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:21:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:21:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:21:20,172][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:21:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:21:21,606][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:21:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:21:23,039][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:21:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:21:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:21:25,191][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:21:25,907][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:21:26,623][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:21:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:21:28,057][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:21:28,776][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:21:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:21:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:21:30,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:21:31,659][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:21:32,743][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:21:32,747][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:21:32,748][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:21:34,207][__main__][INFO] - Iteration 752 took 55s (8.87% Gen, 88.50% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 29m 25s. Estimated total time: 15h 26m 25s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 12s. [2026-03-26 02:21:34,211][__main__][INFO] - Starting iteration 752. [2026-03-26 02:21:34,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:21:34,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:21:39,204][__main__][INFO] - Number of regex retries in iteration 752: 0 [2026-03-26 02:21:39,205][__main__][INFO] - agents played in iteration 752 are Bob, Alice [2026-03-26 02:21:39,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:21:39,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:21:39,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:21:39,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:21:40,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:21:41,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:21:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:21:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:21:43,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:21:43,982][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:21:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:21:45,414][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:21:46,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:21:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:21:47,560][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:21:48,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:21:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:21:49,706][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:21:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:21:51,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:21:51,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:21:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:21:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:21:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:21:54,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:21:55,437][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:21:56,152][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:21:56,870][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:21:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:21:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:21:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:21:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:22:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:22:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:22:01,887][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:22:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:22:03,322][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:22:04,038][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:22:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:22:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:22:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:22:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:22:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:22:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:22:09,061][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:22:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:22:10,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:22:11,210][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:22:11,926][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:22:12,641][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:22:13,357][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:22:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:22:15,037][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:22:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:22:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:22:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:22:17,906][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:22:18,621][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:22:19,339][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:22:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:22:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:22:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:22:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:22:22,923][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:22:23,642][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:22:24,357][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:22:25,076][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:22:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:22:26,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:22:27,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:22:28,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:22:28,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:22:28,511][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:22:29,838][__main__][INFO] - Iteration 753 took 55s (8.96% Gen, 88.64% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 29m 7s. Estimated total time: 15h 27m 3s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 42s, 500 more iterations: 7h 43m 31s. [2026-03-26 02:22:29,841][__main__][INFO] - Starting iteration 753. [2026-03-26 02:22:29,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:22:29,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:22:34,778][__main__][INFO] - Number of regex retries in iteration 753: 0 [2026-03-26 02:22:34,779][__main__][INFO] - agents played in iteration 753 are Bob, Alice [2026-03-26 02:22:35,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:22:35,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:22:35,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:22:35,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:22:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:22:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:22:37,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:22:38,107][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:22:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:22:39,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:22:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:22:40,970][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:22:41,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:22:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:22:43,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:22:43,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:22:44,548][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:22:45,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:22:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:22:46,694][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:22:47,408][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:22:48,124][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:22:48,840][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:22:49,556][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:22:50,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:22:50,988][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:22:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:22:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:22:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:22:53,853][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:22:54,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:22:55,285][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:22:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:22:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:22:57,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:22:58,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:22:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:22:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:23:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:23:01,020][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:23:01,738][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:23:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:23:03,171][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:23:03,888][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:23:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:23:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:23:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:23:06,755][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:23:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:23:08,187][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:23:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-26 02:23:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-26 02:23:10,598][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-26 02:23:11,314][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-26 02:23:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-26 02:23:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-26 02:23:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-26 02:23:14,177][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-26 02:23:14,896][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-26 02:23:15,611][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-26 02:23:16,329][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-26 02:23:17,045][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-26 02:23:17,762][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-26 02:23:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-26 02:23:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-26 02:23:19,911][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-26 02:23:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-26 02:23:21,343][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-26 02:23:22,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-26 02:23:22,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.50%, Current % of VRAM taken: 41.83%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:46 [2026-03-26 02:23:23,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-26 02:23:24,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-26 02:23:24,003][mllm.training.trainer_common][INFO] - Saved trainer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2026_03/ipd_vanilla_ad_align_no_agent_buffer_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-26 02:23:25,384][__main__][INFO] - Iteration 754 took 55s (8.88% Gen, 88.63% Train). Generation: 4s, Training: 49s. Estimated remaining time: 3h 26m 50s. Estimated total time: 15h 25m 41s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 50s. [2026-03-26 02:23:25,388][__main__][INFO] - Starting iteration 754. [2026-03-26 02:23:25,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2026-03-26 02:23:25,394][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-26 02:23:30,517][__main__][INFO] - Number of regex retries in iteration 754: 0 [2026-03-26 02:23:30,518][__main__][INFO] - agents played in iteration 754 are Bob, Alice [2026-03-26 02:23:31,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:23:31,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.59%, Block Peak % of device VRAM: 19.41%, ΔTime: 00:00:00 [2026-03-26 02:23:31,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2026-03-26 02:23:31,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2026-03-26 02:23:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-26 02:23:32,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-26 02:23:33,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-26 02:23:34,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-26 02:23:34,884][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-26 02:23:35,599][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-26 02:23:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-26 02:23:37,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-26 02:23:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-26 02:23:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-26 02:23:39,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-26 02:23:39,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-26 02:23:40,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-26 02:23:41,320][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-26 02:23:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-26 02:23:42,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-26 02:23:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-26 02:23:44,182][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-26 02:23:44,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-26 02:23:45,614][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-26 02:23:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-26 02:23:47,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-26 02:23:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-26 02:23:48,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-26 02:23:49,194][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-26 02:23:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-26 02:23:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-26 02:23:51,343][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-26 02:23:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-26 02:23:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-26 02:23:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-26 02:23:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-26 02:23:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-26 02:23:55,644][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-26 02:23:56,360][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-26 02:23:57,077][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-26 02:23:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-26 02:23:58,510][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-26 02:23:59,227][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-26 02:23:59,943][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-26 02:24:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-26 02:24:01,377][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-26 02:24:02,093][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-26 02:24:02,809][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-26 02:24:03,526][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-26 02:24:04,242][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-26 02:24:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256